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Abstract 


The  challenge  this  thesis  addresses  is  to  speed  up  the  development 
of  concurrent  programs  by  increasing  the  efficiency  with  which  concurrent 
programs  can  be  tested  and  consequently  evolved.  The  goal  of  this  thesis  is  to 
generate  methods  and  tools  that  help  software  engineers  increase  confidence 
in  the  correct  operation  of  their  programs.  To  achieve  this  goal,  this  thesis 
advocates  testing  of  concurrent  software  using  a  systematic  approach  capable 
of  enumerating  possible  executions  of  a  concurrent  program. 

The  practicality  of  the  systematic  testing  approach  is  demonstrated  by 
presenting  a  novel  software  infrastructure  that  repeatedly  executes  a  program 
test,  controlling  the  order  in  which  concurrent  events  happen  so  that  different 
behaviors  can  be  explored  across  different  test  executions.  By  doing  so, 
systematic  testing  circumvents  the  limitations  of  traditional  ad-hoc  testing, 
which  relies  on  chance  to  discover  concurrency  errors. 

However,  the  idea  of  systematic  testing  alone  does  not  quite  solve  the 
problem  of  concurrent  software  testing.  The  combinatorial  nature  of  the 
number  of  ways  in  which  concurrent  events  of  a  program  can  execute  causes 
an  explosion  of  the  number  of  possible  interleavings  of  these  events,  a 
problem  referred  to  as  state  space  explosion. 

To  address  the  state  space  explosion  problem,  this  thesis  studies  tech¬ 
niques  for  quantifying  the  extent  of  state  space  explosion  and  explores 
several  directions  for  mitigating  state  space  explosion:  parallel  state  space 
exploration,  restricted  runtime  scheduling,  and  abstraction  reduction.  In 
the  course  of  its  research  exploration,  this  thesis  pushes  the  practical  limits 
of  systematic  testing  by  orders  of  magnitude,  scaling  systematic  testing  to 
real-world  programs  of  unprecedented  complexity. 
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Chapter  1 
Introduction 


1.1  Problem  Motivation  and  Scope 

The  sum  of  the  computational  power  of  the  500  most  powerful  computers  in  the  world 
has  been  doubling  every  15  months  for  the  past  20  years  reaching  223  Petaflop/ s  in  2013 
(Figure  1.1).  Notably,  this  trend  has  continued  in  spite  of  the  silicon  industry  reaching 
the  CPU  frequency  wall,  an  obstacle  that  has  been  overcome  by  massive  parallelization 
and  distribution  of  computation:  from  single  processors  to  symmetric  multiprocessing 
(SMP)  to  massive  parallel  processing  (MPP)  to  cluster  computers  (Figure  1.2).  To 
leverage  the  computational  power  of  such  architectures,  computer  programs  have 
undergone  a  similar  transition:  from  sequential  to  concurrent  computation. 


PERFORMANCE  DEVELOPMENT 


Figure  1.1:  Evolution  of  Top500  Performance  [102] 

Unfortunately,  the  transition  from  sequential  to  concurrent  programs  unleashed 
not  only  the  power  of  concurrent  processing  but  also  the  terror  of  concurrency  errors. 
Unlike  sequential  programs,  whose  behavior  is  determined  by  their  inputs,  concurrent 
programs  behavior  can  additionally  depend  on  the  relative  speed  of  concurrently 
executing  threads.  The  concurrent  nature  of  the  computation  leads  to  a  combinatorial 
explosion  of  the  number  of  ways  in  which  concurrent  events  of  a  program  can  execute. 
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Figure  1.2:  Evolution  of  Top500  Architectures  [102] 


This  has  proven  a  challenge  for  software  engineers,  who  may  overlook  concurrency 
errors  as  they  fail  to  fathom  all  scenarios  in  which  their  program  can  possibly  execute. 

Although  software  testing  research  has  made  great  leaps  forward,  making  use  of 
techniques  such  as  symbolic  execution  [83],  it  has  struggled  to  generate  methods  and 
tools  that  would  help  software  engineers  build  increasingly  complex  programs  that 
keep  up  with  the  performance  trends  of  new  computing  architectures.  In  particular, 
most  of  the  existing  concurrent  software  is  still  tested  using  ad-hoc  methods,  collectively 
referred  to  as  stress  testing  [56],  which  rely  on  chance  when  it  comes  to  discovering 
concurrency  errors.  Because  of  the  absence  of  a  high  confidence  testing  mechanism, 
software  engineers  are  reluctant  to  make  changes  to  stable  versions  of  their  programs, 
slowing  down  the  rate  at  which  software  evolves. 

The  challenge  this  thesis  addresses  is  to  speed  up  the  development  of  concurrent 
programs  by  increasing  the  efficiency  with  which  concurrent  programs  can  be  tested  and 
consequently  evolved.  The  goal  of  this  thesis  is  to  generate  methods  and  tools  that  help 
software  engineers  increase  confidence  in  the  correct  operation  of  their  programs.  To 
achieve  this  goal,  this  thesis  advocates  testing  of  concurrent  software  using  a  systematic 
approach  capable  of  enumerating  possible  executions  of  a  concurrent  program. 

1.2  Thesis  Statement 

"Existing  concurrent  softiuare  can  be  tested  in  a  systematic  and  scalable  fashion  using  testing 
infrastructure  which  controls  nondeterminism  and  mitigates  state  space  explosion." 

On  a  theoretical  level,  given  a  program,  one  can  view  its  set  of  possible  behaviors 
as  a  random  variable  X  and  ad-hoc  testing  as  a  method  for  sampling  observations 
from  the  probabilistic  distribution  of  X.  In  case  there  exists  an  erroneous  program 
behavior  and  ad-hoc  testing  executes  concurrent  events  in  a  way  that  reveals  the  error 
with  probability  p,  then,  assuming  independence  of  individual  samples,  ad-hoc  testing 
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is  expected  to  discover  the  erroneous  behavior  using  p  1  samples.  In  other  words, 
ad-hoc  testing  is  good  at  detecting  likely  errors,  but  ineffective  at  discovering  corner 
case  errors  that  occur  with  very  low  probability.  Further,  if  the  probability  distribution 
of  the  random  variable  X  is  uniform,  then  ad-hoc  testing  is  expected  to  encounter 
all  behaviors  using  (9  (ft  log  ft)  samples  [22],  where  ft  represents  the  total  number  of 
possible  program  behaviors.  Unfortunately,  probabilistic  distributions  of  real-world 
program  behaviors  are  highly  non-uniform,  rendering  ad-hoc  testing  an  inefficient 
mechanism  for  exploring  the  state  space  of  possible  program  behaviors. 

This  thesis  focuses  on  an  alternative  approach  to  testing  of  concurrent  programs, 
which  avoids  the  aforementioned  shortcomings  of  ad-hoc  testing.  This  approach, 
known  as  systematic  testing,  controls  the  order  in  which  concurrent  events  of  a  program 
happen  in  order  to  guarantee  that  different  executions  of  the  same  test  explore  different 
interleavings  of  concurrent  events.  By  doing  so,  systematic  testing  circumvents  the 
limitations  of  ad-hoc  testing.  In  particular,  unlike  ad-hoc  testing,  systematic  testing  is 
guaranteed  to  encounter  all  interleavings  of  concurrent  events  using  n  samples,  where 
ft  represents  the  total  number  of  possible  interleavings  of  concurrent  events. 

However,  reducing  the  number  of  samples  required  to  enumerate  all  interleavings 
of  concurrent  events  down  to  its  theoretical  minimum  does  not  quite  solve  the  problem 
of  concurrent  software  testing.  The  combinatorial  nature  of  the  number  of  ways  in 
which  concurrent  events  of  a  program  can  execute  causes  an  explosion  of  the  number 
of  possible  interleavings  of  these  events,  a  problem  referred  to  as  state  space  explosion. 
For  a  typical  real-world  program,  the  number  of  all  possible  interleavings  of  concurrent 
events  can  exceed  the  estimated  number  of  atoms  in  the  universe  (1078  -  1082),  rendering 
exhaustive  enumeration  of  all  interleavings  impossible. 

To  address  the  state  space  explosion  problem  in  the  context  of  concurrent  software 
testing,  this  thesis  first  studies  techniques  for  quantifying  the  extent  of  state  space 
explosion  and  then  explores  several  directions  for  mitigating  state  space  explosion: 
parallel  state  space  exploration,  restricted  runtime  scheduling,  and  abstraction  reduction. 
In  the  course  of  its  research  exploration,  this  thesis  pushes  the  practical  limits  of 
systematic  testing  by  orders  of  magnitude,  scaling  systematic  testing  to  real-world 
programs  of  unprecedented  complexity. 


1.3  Thesis  Contributions 

This  thesis  makes  several  contributions.  First,  this  thesis  presents  a  novel  software 
infrastructure  for  systematic  testing  of  real-world  programs  and  uses  this  infrastructure 
as  a  vehicle  to  drive  forward  its  research  on  systematic  testing.  The  infrastructure  enables 
systematic  testing  of  unmodified  binaries  of  concurrent  programs  and  implements  state 
of  the  art  algorithms  for  mitigating  state  space  explosion. 

Second,  this  thesis  pioneers  research  on  estimating  the  extent  of  state  space  explosion, 
a  problem  referred  to  as  state  space  estimation.  In  particular,  this  thesis  presents  a  number 
of  techniques  that  create  and  refine  an  estimate  of  the  number  of  thread  interleavings 
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as  the  space  of  all  possible  thread  interleavings  is  being  explored.  Further,  this  thesis 
demonstrates  the  benefits  of  state  space  estimation  by  using  it  to  implement  intelligent 
policies  for  allocation  of  scarce  resources  to  a  collection  of  systematic  tests. 

Third,  this  thesis  demonstrates  how  to  extend  an  inherently  sequential  state  of  the 
art  algorithm  for  exploration  of  the  space  of  all  possible  thread  interleavings  [51]  to 
parallel  execution  on  a  large  computational  cluster.  This  effort  results  in  a  scalable 
version  of  the  algorithm  that  achieves  strong  scaling  on  computer  clusters  consisting 
of  up  to  one  thousand  of  machines,  speeding  up  the  exploration  of  a  set  of  thread 
interleavings  by  several  orders  of  magnitude. 

Fourth,  this  thesis  explores  integration  of  systematic  testing  with  techniques  for 
restricted  runtime  scheduling.  In  particular,  this  thesis  presents  an  eco-system  created  by 
integrating  its  software  infrastructure  with  a  novel  deterministic  and  stable  multithread¬ 
ing  thread  runtime  [39],  which  reduces  the  number  of  possible  thread  interleavings. 
There  is  synergy  between  systematic  testing  and  restricted  runtime  scheduling:  system¬ 
atic  testing  helps  to  check  the  set  of  thread  interleavings  that  are  allowed  by  the  restricted 
scheduler,  while  the  restricted  scheduler  reduces  the  number  of  thread  interleavings 
systematic  testing  needs  to  check. 

Fifth,  this  thesis  demonstrates  how  modeling  program  behavior  at  higher  levels  of 
abstraction,  referred  to  as  abstraction  reduction,  mitigates  the  extent  of  state  space  explo¬ 
sion.  In  general,  the  granularity  of  events  interleaved  by  a  thread  interleaving  can  range 
from  machine  instructions  to  high-level  function  calls.  Abstraction  reduction  mimics  the 
behavior  of  software  engineers  that  trust  implementations  of  common  coordination  and 
communication  libraries,  such  as  POSIX  threads  (pthreads)  [120]  or  Message  Passing 
Interface  (MPI)  [103].  In  particular,  abstraction  reduction  treats  common  coordination 
and  communication  events  as  atomic,  shifting  the  focus  of  systematic  testing  from 
testing  the  internals  of  common  coordination  and  communication  libraries  to  testing 
the  programs  that  use  them. 

1.4  Thesis  Organization 

The  rest  of  this  thesis  is  organized  as  follows.  Chapter  2  covers  both  the  theoretical  and 
practical  foundations  this  thesis  builds  upon.  Chapter  3  presents  the  software  infrastruc¬ 
ture  developed  in  support  of  the  research  carried  out  by  this  thesis.  Chapter  4  presents 
the  design,  implementation,  and  evaluation  of  mechanisms  for  state  space  estimation 
and  resource  allocation  policies  based  on  state  space  estimates.  Chapter  5  presents  the 
design,  implementation,  and  evaluation  of  an  algorithm  for  parallel  exploration  of  a  set 
of  thread  interleavings.  Chapter  6  presents  the  design,  implementation,  and  evaluation 
of  an  eco-system  formed  by  integrating  our  systematic  testing  infrastructure  with  a 
restricted  runtime  scheduling  system.  Chapter  7  presents  the  design,  implementation, 
and  evaluation  of  an  approach  that  leverages  abstraction  to  mitigate  state  space  explo¬ 
sion.  Chapter  8  provides  an  overview  of  related  work  and  Chapter  9  discusses  future 
work  and  draws  conclusions. 
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Chapter  2 

Systematic  Testing  Background 


This  chapter  first  describes  state  of  the  art  algorithms  for  systematic  testing  (§2.1), 
focusing  on  algorithms  pertinent  to  systematic  testing  of  concurrent  programs,  and 
then  presents  an  overview  of  the  evolution  of  tools  for  systematic  testing  of  concurrent 
programs  (§2.2). 


2.1  Theory 

To  gain  confidence  in  the  correct  operation  of  a  concurrent  program  through  testing, 
a  software  engineer  must  verify  as  many  program  behaviors  as  possible  in  the  time 
allotted  for  testing.  The  first  step  towards  effective  testing  of  a  concurrent  program  is 
the  ability  to  evaluate  different  scenarios  in  which  concurrent  events  of  the  program 
may  occur.  The  next  step  is  then  achieving  efficiency  through  automation  of  the  testing 
process,  which  should  avoid  evaluation  of  redundant  or  infeasible  outcomes  and  at  the 
same  time  fully  utilize  the  available  computational  resources. 

Traditional  approaches  to  testing  concurrent  programs  fail  to  meet  these  criteria. 
Ad-hoc  approaches  such  as  stress  testing  [56]  evaluate  a  program  using  a  range  of  tests. 
The  intent  of  a  stress  test,  typically  either  a  large  collection  of  concurrent  stimuli  or  a 
small  one  repeated  many  times,  is  to  drive  a  program  into  as  many  different  scenarios 
as  possible.  However,  stress  testing  is  not  guaranteed  to  exercise  all  possible  scenarios. 

To  achieve  better  coverage,  researchers  have  developed  and  evolved  exhaustive 
approaches  such  as  theorem  proving  [31,  43,  49,  95,  125]  or  model  checking  [14,  33, 
35,  72],  Although  these  approaches  have  the  potential  to  provide  guarantees  about  all 
possible  scenarios,  their  practicality  is  limited  by  the  false  positives  and  false  negatives 
generally  introduced  during  modeling  [9],  the  expert  knowledge  needed  to  prove 
theorems  about  the  program  [84],  or  the  effort  needed  to  create  and  verify  accurate 
models  of  existing  software  and  update  these  models  as  the  software  changes  [24]. 

In  parallel  with  the  above  exhaustive  approaches,  research  in  software  verifica¬ 
tion  [61,  71]  have  produced  practical  tools  for  systematic  testing  of  concurrent  software. 
Unlike  stress  testing,  systematic  testing  is  able  to  enumerate  different  scenarios  and 
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unlike  the  above  exhaustive  approaches,  systematic  testing  operates  over  the  actual 
implementation,  striking  a  balance  between  coverage  and  practicality. 

The  idea  behind  systematic  testing  is  to  systematically  resolve  nondeterminism, 
enumerating  different  scenarios  the  program  under  test  could  experience.  The  sources 
of  nondeterminism  can  be  divided  into  two  broad  categories:  scheduling  nondeterminism 
and  input  nondeterminism. 

Scheduling  nondeterminism  arises  in  concurrent  programs  when  the  relative  speed 
of  concurrently  executing  threads  of  computation  can  affect  the  program  behavior,  while 
input  nondeterminism  arises  in  both  sequential  and  concurrent  programs  when  program 
behavior  depends  on  program  inputs.  Systematic  testing  of  scheduling  nondeterminism 
and  input  nondeterminism  are  two  orthogonal  problems. 

This  thesis  is  concerned  with  systematic  testing  of  scheduling  nondeterminism  and 
the  theory  presented  in  the  remainder  of  this  section  assumes  that  scheduling  is  the  only 
source  of  nondeterminism.  Obviously,  this  is  rarely  the  case  for  real-world  programs 
whose  behavior  can  depend  on  a  plethora  of  inputs  such  as  user  data,  file  contents,  or 
timing.  Chapter  3  describes  how  our  infrastructure  for  systematic  testing  of  concurrent 
programs  addresses  input  nondeterminism. 

The  remainder  of  this  section  gives  an  overview  of  stateful  exploration  [71,  105], 
stateless  exploration  [61],  partial  order  reduction  (POR)  [60],  and  dynamic  partial  order 
reduction  (DPOR)  [51],  state  of  the  art  algorithms  for  systematic  testing  of  concurrent 
programs. 

2.1.1  Stateful  Exploration 

Stateful  exploration  [12,  71, 105]  is  a  technique  that  targets  systematic  testing  of  nonde- 
terministic  programs.  The  goal  of  stateful  exploration  is  to  explore  the  state  space  of 
possible  program  states  by  systematically  resolving  program  nondeterminism,  explicitly 
recording  program  states  to  avoid  re-visitation  and  to  guarantee  completeness. 

Algorithm  1  gives  a  high-level  overview  of  stateful  exploration.  To  keep  track  of  the 
exploration  progress,  stateful  exploration  maintains  two  collections  of  program  states. 
The  visited  collection  records  program  states  that  have  been  visited,  while  the  reachable 
collection  records  program  states  that  are  known  to  be  reachable. 

The  program  states  contained  in  the  visited  and  reachable  collections  are  stored 
explicitly:  for  traditional  hardware  architectures,  a  program  state  thus  consists  of 
the  relevant  context  stored  in  the  hardware  registers,  main  memory,  and  on  disk. 
To  navigate  a  program  to  a  particular  program  state  (Algorithm  1,  Line  6),  stateful 
exploration  simply  loads  the  previously  stored  context. 

The  Children  (node)  function  (Algorithm  1,  Line  7)  resumes  execution  from  the 
program  state  node.  Once  a  nondeterministic  choice  is  encountered,  a  set  of  program 
states  corresponding  to  all  possible  outcomes  of  the  choice  is  returned. 

The  main  drawback  of  stateful  exploration  is  that  for  real-world  programs,  explicitly 
storing  the  program  states  can  require  considerable  space  [29, 105, 165],  which  limits  the 
size  of  the  state  space  that  can  be  fully  explored.  To  mitigate  this  problem,  researchers 
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Algorithm  1  StatefulExploration(  root) 

Require:  root  is  the  initial  program  state. 

Ensure:  All  program  states  reachable  from  root  are  explored. 
1:  visited  a-  NewSet 
2:  reachable  NewSet 
3:  Insert  (root,  reachable) 

4:  while  visited  ^  reachable  do 

5:  node  ^ —  an  arbitrary  element  of  reachable  \  visited 

6:  navigate  execution  to  node 

7:  for  all  c/zz7d  £  Children  (node)  do 

8:  if  child  £  reachable  then 

9:  Insert  (c/zz7d,  reachable ) 

10:  end  if 

11:  end  for 

12:  Insert  ( node,  zzzsz'fed) 

13:  end  while 


have  devised  a  number  of  schemes  such  as  compression  and  hashing  [71],  selective 
caching  [13],  or  parallel  processing  [27]. 

In  spite  of  the  advances  in  stateful  exploration,  the  majority  of  practical  tools  for 
systematic  testing  for  concurrent  programs  [61,  82,  107,  160,  163]  are  based  on  an 
alternative  approach  called  stateless  exploration. 

2.1.2  Stateless  Exploration 

Stateless  exploration  [61]  is  a  technique  that  targets  systematic  testing,  particularly 
suitable  for  systematic  testing  of  concurrent  programs.  Although  stateless  exploration 
pursues  the  same  goal  as  stateful  exploration,  stateless  exploration  does  not  store 
program  states  explicitly.  Instead,  stateless  exploration  represents  a  program  state 
implicitly  using  the  sequence  of  nondeterministic  choices  that  lead  to  the  program  state 
from  the  initial  program  state. 

To  keep  track  of  the  exploration  progress,  stateless  exploration  abstractly  represents 
the  state  space  of  different  program  states  using  an  execution  tree.  Nodes  of  the  execution 
tree  represent  nondeterministic  choices  and  edges  of  the  execution  tree  represent 
program  state  transitions.  A  path  from  the  root  of  the  tree  to  a  leaf  then  uniquely 
encodes  a  program  execution  as  a  sequence  of  nondeterministic  choices. 

Abstractly,  enumerating  the  branches  of  the  execution  tree  corresponds  to  an  enu¬ 
meration  of  different  sequences  of  program  state  transitions.  Notably,  the  set  of  explored 
branches  of  a  partially  explored  execution  tree  identifies  which  sequences  of  program 
state  transitions  have  been  explored.  Further,  the  information  collected  by  past  execu¬ 
tions  can  be  used  to  generate  schedules  that  describe  in  what  order  to  sequence  program 
state  transitions  of  future  executions  in  order  to  explore  new  parts  of  the  state  space. 
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Algorithm  2  StatelessExploration  (root) 

Require:  root  is  the  initial  program  state. 

Ensure:  All  program  states  reachable  from  root  are  explored. 
1:  frontier  NewStack 
2:  Push  ({root},  frontier) 

3:  while  frontier  not  empty  do 

4:  while  Top  (frontier)  not  empty  do 

5:  node  <—  an  arbitrary  element  of  Top  (frontier) 

6:  remove  node  from  Jov (frontier) 

7:  navigate  execution  to  node 

8:  if  Children  (node)  not  empty  then 

9:  Push  (Children  ( node),  frontier) 

10:  end  if 

11:  end  while 

12:  Pop  (frontier) 

13:  end  while 


Algorithm  2  gives  a  high-level  overview  of  stateless  exploration.  The  algorithm 
maintains  an  exploration  frontier,  represented  as  a  stack  of  sets  of  nodes,  and  uses 
depth-first  search  to  explore  the  execution  tree.  Depth-first  search  is  used  to  achieve 
space  complexity  that  is  linear  in  the  depth  of  the  execution  tree. 

To  navigate  a  program  to  a  particular  program  state  (Algorithm  2,  Line  7),  stateless 
exploration  recreates  the  initial  program  state  and  then  carries  out  the  sequence  of 
nondeterministic  choices  identified  by  the  execution  tree.  The  Children  (node)  function 
(Algorithm  2,  Line  8)  behaves  the  same  as  before. 

The  main  drawback  of  stateless  exploration  is  that  different  sequences  of  nondeter¬ 
ministic  choices  can  represent  identical  concrete  program  states,  resulting  in  redundant 
exploration.  To  address  this  problem,  stateless  exploration  is  typically  augmented  with 
state  space  reduction  techniques  [51,  60]. 

2.1.3  Partial  Order  Reduction 

Partial  order  reduction  (POR)  [60]  is  a  technique  that  targets  efficient  systematic  testing 
of  concurrent  programs.  The  goal  of  POR  is  to  reduce  the  inefficiency  of  state  space 
exploration  resulting  from  exploration  of  equivalent  program  states. 

Algorithm  3  gives  a  high-level  overview  of  stateless  exploration  based  on  partial 
order  reduction.  Similar  to  stateless  exploration,  the  partial  order  reduction  algorithm 
maintains  an  exploration  frontier,  represented  as  a  stack  of  sets  of  nodes,  and  uses 
depth-first  search  to  explore  the  execution  tree. 

The  PERSiSTENTSET(node)  function  (Algorithm  3,  Line  3)  uses  static  analysis  to 
identify  what  subtrees  of  the  execution  tree  need  to  be  explored.  In  particular,  it  inputs 
a  node  of  the  execution  tree  and  outputs  a  subset  of  the  Children (node)  set  that  needs 
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Algorithm  3  Pa rti  alOrderRe  duction  ( ro o t ) 

Require:  root  is  the  initial  program  state. 

Ensure:  All  non-equivalent  program  states  reachable  from  root  are  explored. 
1:  procedure  DEPTHFiRSTSEARCH(roof,/roftfzhr) 

2:  remove  node  from  Top  (frontier) 

3:  if  PersistentSet  (zzodc )  not  empty  then 

4:  Push  (PersistentSet  (node) ,  frontier) 

5:  for  all  child  G  Top  (frontier)  do 

6:  navigate  execution  to  child 

7:  DEPTHFlRSTSEARCH(dzz7(i,/rOUfzer) 

8:  end  for 

9:  Pop  (frontier) 

10:  end  if 

11:  end  procedure 
12:  frontier  G-  NewStack 
13:  Push( {root }, frontier) 

14:  DEPTHFlRSTSEARCH(r00f,/r01Zfzer) 


to  be  explored  in  order  to  explore  all  non-equivalent  sequences  of  program  states;  two 
sequences  of  program  states  are  considered  equivalent,  if  using  them  to  evaluate  a 
program  property  yields  the  same  result.  Further  details  behind  the  computation 
of  the  PersistentSet  function  are  unnecessary  for  understanding  the  content  of  this 
thesis  are  omitted  for  brevity.  The  interested  reader  is  referred  to  Godefroid's  seminal 
treatment  [60]. 

The  main  drawback  of  partial  order  reduction  is  that  static  analysis  of  complex 
programs  is  often  costly  or  infeasible  and  results  in  larger  than  necessary  persistent  sets. 
To  address  this  problem,  static  analysis  can  be  replaced  with  dynamic  analysis  [51]. 

2.1.4  Dynamic  Partial  Order  Reduction 

Dynamic  partial  order  reduction  (DPOR)  [51]  is  a  technique  that  targets  efficient  state 
space  exploration.  The  goal  of  DPOR  is  to  reduce  the  inefficiency  of  partial  order 
reduction  resulting  from  the  use  of  static  analysis.  Namely,  when  stateless  exploration 
explores  the  execution  tree,  DPOR  relies  on  dynamic  analysis  to  decide  how  to  augment 
the  existing  exploration  frontier. 

Algorithm  4  gives  a  high-level  overview  of  stateless  exploration  based  on  dynamic 
partial  order  reduction.  Similar  to  stateless  exploration,  the  dynamic  partial  order 
reduction  algorithm  maintains  an  exploration  frontier,  represented  as  a  stack  of  sets  of 
nodes,  and  uses  depth-first  search  to  explore  the  execution  tree. 

The  Ur d at e F ro nt i  e r  (fro n  t ie r,  node)  function  (Algorithm  4,  Line  3)  uses  dynamic 
analysis  to  identify  which  subtrees  of  the  execution  tree  need  to  be  explored.  In 
particular,  the  function  inputs  the  current  exploration  frontier  and  the  current  node  and 
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Algorithm  4  DynamicPartialOrderReduction  (root) 

Require:  root  is  the  initial  program  state. 

Ensure:  All  non-equivalent  program  states  reachable  from  root  are  explored. 
1:  procedure  DEPTHFiRSTSEARCH(roof,/rorzhbr) 

2:  remove  node  from  Top  (frontier) 

3:  Up  d  at  e  F  ro  n  t  i  e  r  (/ro ;;  f  /'o  r, ;  /  o  t/e ) 

4:  if  Children  (node)  not  empty  then 

5:  child  u-  arbitrary  element  of  Children  (node) 

6:  Push  ({child}, fron  tier) 

7:  for  all  child  G  Top  (frontier)  do 

8:  navigate  execution  to  child 

9:  DepthFirstSe  arch  (c/n'/d,/ron  her) 

10:  end  for 

11:  Pop  (frontier) 

12:  end  if 

13:  end  procedure 
14:  frontier  o-  NewStack 
15:  Push  ({root},  frontier) 

16:  DepthFirstSearch ( root, frontier) 


computes  the  happens-before  [88]  and  dependence  [60]  relations  between  the  transitions 
on  the  path  leading  to  the  current  node.  This  information  is  then  used  to  infer  which 
nodes  need  to  be  added  to  the  exploration  frontier  in  order  to  explore  all  non-eqidvalent 
sequences  of  program  states.  Unlike  the  PersistentSet  function,  the  UpdateFrontier 
function  can  add  nodes  to  an  arbitrary  set  of  the  exploration  frontier  stack,  an  aspect  of 
DPOR  that  prevents  straightforward  parallelization  of  the  execution  tree  exploration 
(cf.  Chapter  5).  Further  details  of  the  computation  of  the  UpdateFrontier  function  are 
unnecessary  for  understanding  the  content  of  this  thesis  are  omitted  for  brevity.  The 
interested  reader  is  referred  to  the  original  DPOR  paper  [51]. 

2.2  Practice 

Research  in  software  verification  has  produced  a  number  of  tools  for  systematic  testing 
of  concurrent  programs  including  VeriSoft  [61,  62],  CMC  [105],  FiSC  [164,  165],  eX- 
plode  [163],  MaceMC  [82],  ISP  [118,  148],  CHESS  [106,  107],  MoDist  [160],  dBug  [135, 
136,  137],  and  ETA  [138]. 

All  of  the  these  tools  are  built  around  the  same  idea:  for  a  given  concurrent  program 
and  its  initial  program  state,  systematically  and  automatically  enumerate  different 
scheduling  scenarios  searching  for  errors.  What  differentiates  these  tools  is  1)  the 
operating  systems  and  programming  languages  they  target,  2)  the  techniques  they  use 
for  exploring  and  reducing  the  state  space  of  possible  scenarios,  3)  the  type  of  properties 
they  check  for,  and  4)  the  effort  needed  to  deploy  the  tool. 
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Table  2.1:  Overview  of  Systematic  Testing  Tools 


Table  2.1  presents  an  overview  of  the  key  attributes  of  these  tools.  The  tool  column 
lists  the  tools  in  a  chronological  order.  The  operating  system  column  identifies  what 
operating  systems,  if  any,  is  the  tool  restricted  to.  The  asterisk  character  is  used  when 
the  tool  works  independently  of  the  operating  system.  The  programming  language 
column  identifies  what  programming  languages,  if  any,  is  the  tool  restricted  to.  The 
asterisk  character  is  used  when  the  tool  works  independently  of  the  programming 
language.  The  state  space  exploration  column  identifies  what  type  of  state  space 
exploration  the  tool  uses.  The  state  space  reduction  column  identifies  what  type  of 
state  space  reduction  the  tool  uses.  The  properties  checked  column  identifies  what 
properties  the  tool  checks  for.  In  particular,  this  column  specifies  if  the  tool  checks 
for  1)  user-provided  propositions  in  addition  to  generic  propositions  and  2)  temporal 
propositions  [99]  in  addition  to  other  propositions.  Finally,  the  deployment  effort 
column  identifies  the  effort  needed  to  deployment  the  tool. 

The  remainder  of  this  section  offers  a  summary  of  all  of  these  tools  with  the  exception 
of  dBug  and  ETA.  The  dBug  and  ETA  tools  were  built  by  this  thesis  author  in  support 
of  the  research  carried  out  by  this  thesis  and  are  described  in  detail  in  Chapter  3. 


2.2.1  VeriSoft 

The  first  practical  tool  for  systematic  testing  of  concurrent  programs  -  VeriSoft  [61]  - 
was  developed  by  Patrice  Godefroid  in  1997.  VeriSoft  can  explore  different  serializations 
of  concurrent  library  function  calls  of  multithreaded  C  and  C++  programs  and  requires 
manual  instrumentation  of  the  program.  Since  its  creation  in  1997,  VeriSoft  has  been 
successfully  applied  to  prove  safety  properties  and  to  find  bugs  in  a  number  of  programs 
ranging  from  critical  components  of  a  telephone  switch  [63]  to  complete  releases  of 
call-processing  software  [30]. 


2.2.2  CMC 

In  2002,  a  research  group  at  Stanford  developed  the  C  Model  Checker  (CMC)  [105], 
which  can  explore  different  serializations  of  event  handlers  in  C  and  C++  programs. 
CMC  was  the  first  systematic  testing  tool  to  introduce  fault  injection  as  a  mechanism  to 
test  the  error  handling  of  corner  case  scenarios.  The  main  conceptual  difference  between 
VeriSoft  and  CMC  is  the  state  space  exploration  technique.  VeriSoft  only  stores  the 
initial  state  and  possibly  re-explores  parts  of  the  state  space,  while  CMC  stores  every 
intermediate  state,  which  allows  for  faster  state  space  traversal.  Anecdotally,  VeriSoft 
typically  runs  out  of  time,  while  CMC  typically  runs  out  of  memory.  The  evaluation 
of  CMC  [105]  checked  three  implementations  of  the  AODV  networking  protocol  [117], 
finding  a  total  of  40  bugs. 
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2.2.3  FiSC 


In  2004,  the  same  group  at  Stanford  used  CMC  as  the  foundation  for  building  the  FiSC 
tool  [165]  for  systematic  testing  of  Linux  file  systems.  The  key  idea  behind  FiSC  is  to 
enumerate  different  ways  in  which  concurrent  file  system  operations  can  modify  data 
and  meta-data  and  simulate  system  crashes  at  arbitrary  time  points,  searching  for  file 
system  inconsistencies.  Similar  to  CMC,  FiSC  uses  a  stateful  approach,  encoding  each 
concrete  state  as  a  collection  of  hashed  chunks  of  the  concrete  state  in  order  to  decrease 
memory  footprint  of  each  state.  The  evaluation  of  FiSC  [165]  checked  three  file  systems 
-  ext3,  JFS,  and  ReiserFS  -  and  found  serious  bugs  in  all  of  them,  32  in  total. 

2.2.4  eXplode 

In  2006,  the  same  group  at  Stanford  created  the  eXplode  tool  [163]  for  systematic 
testing  of  arbitrary  storage  systems.  Unlike  FiSC,  which  inherited  the  stateful  nature 
of  its  exploration  from  CMC,  eXplode  embraces  stateless  exploration,  which  "lead  to 
orders  of  magnitude  reduction  in  complexity  and  effort  in  using  eXplode  as  opposed  to 
FiSC".  Notably,  eXplode  is  able  to  check  storage  systems  without  requiring  their  source 
code.  The  extensive  evaluation  of  eXplode  [163]  checked  three  version  control  systems, 
Berkeley  DB  [108],  an  NFS  implementation  [132],  ten  file  systems,  a  RAID  system  [  1 16], 
and  the  VMware  GSX  virtual  machine,  finding  bugs  in  all  of  them,  36  in  total. 

2.2.5  MaceMC 

In  2007,  a  group  at  UCSD  created  the  programming  language  Mace  [81],  geared  towards 
designing  distributed  programs,  and  the  systematic  testing  tool  MaceMC  [82],  for 
verifying  Mace  designs.  The  Mace  programming  language  groups  events  into  Mace 
transitions  and  MaceMC  then  uses  this  abstraction  to  systematically  explore  different 
serializations  of  concurrent  transitions.  A  unique  feature  of  MaceMC  enabled  by  the 
co-design  of  a  programming  language  and  a  testing  tool  is  that  it  can  check  for  both 
safety  and  bounded  liveness  properties.  A  number  of  distributed  programs  including 
peer-to-peer  systems  Chord  [145]  and  Pastry  [126]  were  re-implemented  in  Mace.  The 
evaluation  of  MaceMC  [82]  checked  Mace  implementations  of  four  different  programs 
and  found  a  total  of  51  bugs. 

2.2.6  ISP 

In  2007,  a  collaboration  between  the  University  of  Utah  and  Argonne  National  Labo¬ 
ratories  was  started  that  eventually  resulted  in  the  ISP  tool  [118,  147]  for  systematic 
testing  of  MPI  programs  [103].  The  tool  replaces  key  MPI  communication  primitives 
with  wrappers  that  are  used  to  control  the  order  in  which  MPI  primitives  execute. 
The  systematic  testing  is  deployed  via  a  link-time  interposition.  The  evaluation  of 
ISP  [118,  147]  checked  four  MPI  programs  and  found  bugs  in  all  of  them. 
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2.2.7  CHESS 


In  2008,  researchers  from  Microsoft  Research  Redmond  designed  the  CHESS  tool  [107], 
which  explores  different  serializations  of  memory  accesses  in  a  multithreaded  program. 
CHESS  uses  run-time  interposition  and  thus  does  not  require  any  source  code  modifica¬ 
tions.  This  feature  greatly  simplifies  the  integration  of  CHESS  into  the  testing  process. 
The  evaluation  of  CHESS  [107]  checked  eight  different  programs  ranging  from  process 
management  libraries  to  a  distributed  execution  engine  to  a  research  operating  system, 
finding  bugs  in  all  of  them,  27  in  total. 

2.2.8  MoDist 

In  2009,  a  combined  effort  of  researchers  from  research  labs  and  universities  in  both  the 
United  States  and  China  lead  to  the  creation  of  the  systematic  testing  tool  MoDist  [160]. 
The  MoDist  project  combined  ideas  from  previous  work  to  deliver  a  tool  capable  of 
systematic  exploration  of  different  serializations  of  concurrent  system  calls  in  multi¬ 
threaded  and  distributed  Windows  programs.  Similar  to  CHESS,  MoDist  does  not 
require  source  code  modifications.  However,  instead  of  using  run-time  interposition, 
MoDist  relies  on  binary  instrumentation  of  Windows  API  provided  by  Box  [68].  MoDist 
evaluation  [160]  checked  three  distributed  programs  -  the  replication  extension  of  Berke¬ 
ley  DB  [108],  a  production  implementation  of  the  Paxos  protocol  [89],  and  a  prototype 
of  a  primary-backup  replication  protocol  -  and  found  a  total  of  35  bugs. 


This  list  of  success  stories  demonstrates  that  there  is  momentum  in  advancing  sys¬ 
tematic  testing  of  concurrent  programs  and  provides  evidence  that  systematic  testing 
tools  are  effective  at  finding  errors.  The  success  of  the  above  tools  stems  from  sophisti¬ 
cated  techniques  for  mitigating  the  combinatorial  explosion  of  the  number  of  possible 
program  states  of  a  concurrent  program. 

Interestingly,  none  of  the  previous  work  actually  attempts  to  measure  how  bad  the 
state  space  explosion  problem  actually  is.  In  cases  where  systematic  testing  fails  to 
exhaust  the  state  space,  the  absence  of  a  state  space  size  estimate  poses  a  challenge  to 
quantifying  the  benefits  of  the  test  results.  To  address  this  problem.  Chapter  4  describes 
and  evaluates  techniques  for  estimating  the  size  of  the  stateless  exploration  space.  Our 
evaluation  on  a  wide  range  of  concurrent  programs  provides  evidence  of  the  extent  of 
state  space  explosion. 

The  rest  of  the  thesis  then  presents  research  that  aims  to  push  the  practical  limits  of 
systematic  testing.  Chapter  5  presents  a  novel  algorithm  [139]  for  state  space  exploration 
that  scales  extremely  well,  enabling  stateless  exploration  for  large  computer  clusters. 
Chapters  6  and  7  show  how  restricted  runtime  scheduling  [39]  and  abstraction  [38] 
help  to  alleviate  the  state  space  explosion  problem. 
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Chapter  3 

Systematic  Testing  Infrastructure 


This  chapter  describes  two  tools  built  in  support  of  the  research  carried  out  by  this  thesis. 
The  ETA  tool  (§3.1)  targets  systematic  testing  of  scheduling  nondeterminism  in  multi¬ 
threaded  components  of  the  Omega  cluster  management  system  [129],  while  the  dBug 
tool  (§3.2)  targets  systematic  testing  of  scheduling  nondeterminism  in  multithreaded 
and  distributed  programs  of  POSIX-compliant  operating  systems  [119]. 

To  enable  systematic  testing  of  scheduling  nondeterminism,  both  dBug  and  ETA 
need  to  address  common  challenges.  Consequently,  the  presentation  of  ETA  and  dBug 
follows  a  common  pattern.  First,  the  presentation  introduces  a  formal  model  used  for 
representing  executions  of  target  programs.  The  reason  for  defining  these  models  is  to 
provide  a  formal  framework  for  describing  the  happens-before  [88]  and  dependence  [60] 
relations  between  program  state  transitions.  These  relations  are  computed  by  both 
tools  in  order  to  enable  efficient  state  space  exploration  based  on  dynamic  partial  order 
reduction  [51] .  Second,  the  presentation  discusses  the  mechanisms  the  tools  use  to 
control  input  and  scheduling  nondeterminism  and  implement  state  space  exploration. 
Finally,  the  presentation  highlights  notable  aspect  of  the  respective  implementations. 


3.1  ETA 

This  section  describes  ETA  [138],  a  tool  for  systematic  testing  of  multithreaded  com¬ 
ponents  of  the  Omega  cluster  management  system  [129].  In  particular,  §3.1.1  defines 
a  model  for  representing  multithreaded  components  of  Omega,  which  embrace  the 
actors  programming  paradigm  [2],  and  §3.1.2  describes  ETA's  implementation  and  its 
integration  with  pre-existing  testing  infrastructure. 

3.1.1  Program  Model 

Conceptually,  an  actor  program  consists  of  a  set  of  actors,  each  of  which  has  its  own 
private  state,  a  public  queue  for  receiving  messages,  and  a  set  of  handlers  for  processing 
of  queued  messages.  The  only  way  two  actors  can  communicate  is  by  sending  messages 
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to  one  another.  Further,  each  individual  actor  is  sequential,  repeatedly  invoking 
handlers  to  process  queued  messages.  Processing  of  a  message  is  assumed  never 
to  block  indefinitely  and  can  both  modify  the  private  state  of  the  actor  processing 
the  message  and  push  new  messages  on  queues  of  other  actors.  An  actor  program 
realizes  parallelism  by  concurrently  executing  handlers  of  different  actors.  The  following 
definition  formalizes  the  notion  of  an  actor  program. 

Definition  3.1.1.  An  n-actor  program  M  is  a  tuple  of  actors  (A\, . . . ,  An),  where  for 
i  G  {1, . . . ,  ft}  an  actor  A;  is  in  turn  a  tuple  (Q„  Lj,  A*)  consisting  of  a  set  Q,  of  queue  states, 
a  set  Lf  of  local  states,  and  a  transition  function  A,-  :  Messages  x  Lj  (Q\  x  ...  x  Qn)  x  Lj. 
A  queue  state  qj  G  Q,  is  a  finite  sequence  (mi, . . . ,  rnf)  G  Messagesk  of  messages. 

Note  that  the  above  definition  intentionally  leaves  several  terms  undefined.  The  set 
Messages  is  assumed  to  be  an  abstract  representation  of  the  set  of  messages  actors  use 
to  communicate.  In  particular,  the  inputs  and  outputs  of  the  actor  program  are  encoded 
as  messages.  The  sets  L\, . . . ,  Ln  are  assumed  to  be  an  abstract  representation  of  the 
local  state  of  each  actor  and  the  transition  functions  Ai, ...  ,An  are  assumed  to  be  an 
abstract  representation  of  the  handlers  invoked  to  process  queued  messages. 

Next,  a  state  of  an  actor  program  is  defined  as  a  vector  of  pairs  of  local  and  queue 
states  belonging  to  the  actors  of  the  actor  program. 

Definition  3.1.2.  Let  M  be  an  n-actor  program.  A  state  of  M  is  defined  as  a  tuple 
((qi,h),...,(qn,ln))  G  (Qi  x  Lx)  x  . . .  x  (Qn  x  Ln).  Further,  a  state  of  M  can  be  in¬ 
spected  using  the  following  projections: 

•  actor  :  ((Qx  x  Lf)  x  ...  x  (Q„  x  L„))  x  (1 ,...,«}  -A  U"=i  (Qi  x  Lj),  a  function 
such  that  actor(((qlrh), . . . ,  (qn,ln)),i)  =  (quh) 

•  queue  :  |J”=1  (Qi  x  Lj)  -A  (J”=1  Qn  a  function  such  that  queue(qj,lj)  =  qj 

•  local  :  U”=i (Qi  x  Lj)  U/Li  Lj,  a  function  such  that  local(qj,lj)  =  lj 

Next,  the  transition  function  of  an  actor  program  is  defined  as  a  combination  of  the 
transition  functions  of  the  actor  program  actors.  In  particular,  the  formal  model  of  an 
actor  program  transitions  between  states  by  choosing  an  actor  with  a  non-empty  queue, 
removing  a  message  from  the  queue  of  that  actor,  and  invoking  the  corresponding 
message  handler.  By  the  nature  of  the  transition  function,  the  invoked  message  handler 
can  append  messages  to  a  message  queue  of  any  actor,  but  it  can  only  change  the  local 
state  of  the  actor  invoking  the  handler. 

Definition  3.1.3.  Let  M  be  an  n-actor  program.  The  transition  function  of  M  is  a 
function  AM  :  (Qx  x  Lx)  x  . . .  x  (Q„  x  Ln)  x  {1,. . .,«}  -A  {_L}  U  ((Qi  xLj)x...x 
(Qn  x  Ln))  such  that  Am((^1/ h),  ■  ■  ■ ,  (qn,  In)/  i)  equals  _L  if  qj  is  empty.  Otherwise, 
Am((<?i,  h),  •  •  •  /  (qn,  h ),  i )  equals  ( (qt  o  q[,  If),...,  ( tail  ( qt )  o  q'.,  /•),...,  (qn  o  q'n,  ln))  where 
q'v...,q'n  and  Z-  are  such  that  A  j(head(qj),lj)  =  ((q[, . . .  ,q'n), /•). 

Next,  an  execution  of  an  actor  program  is  defined  as  a  sequence  of  states.  Further, 
this  sequence  is  identified  with  1)  an  index  sequence  that  encodes  the  order  in  which 
actors  process  messages  along  the  execution  and  2)  the  sequence  of  messages  processed 
along  the  execution. 
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Definition  3.1.4.  Let  M  be  an  n-actor  program.  Am  its  transition  function,  and  s 
its  state.  An  execution  of  M  from  s  is  a  finite  sequence  a  =  (sq,  . . .  ,sf)  of  states 
of  M  such  that  s  =  so  and  there  exists  an  index  sequence  I  (a)  =  (z'o,  ...,4-i)  - 
{1  ,...,njk  such  that  Sy+i  =  Am(s(,  if)  for  all  j  G  { 0, . . . ,  /c  —  1 } .  Further,  the  mes¬ 
sage  sequence  M(a)  =  (mo, . . .  ,m^-i)  G  Messcigesk  is  a  sequence  of  messages  such  that 
nij  =  head(queue(actor(sj,ij )))  for  /  G  {0, . . .  ,k  —  1}  and  M(a)  is  used  to  denote  the  set 
{m0, . . .  ,m^_!}. 

Next,  the  happens-before  [88]  and  dependence  relations  [60]  are  defined  to  track 
causality  and  interactions  between  messages.  These  relations  are  necessary  in  order  to 
adapt  the  dynamic  partial  order  reduction  algorithm  (DPOR)  [51]  to  actor  programs.  In 
other  words,  the  happens-before  relation  is  a  transitive  closure  of  the  causal  order  in 
which  1)  an  actor  processes  messages,  2)  a  message  handler  creates  new  messages. 

Definition  3.1.5.  Let  a  =  (so,  -  -  - ,  s*;)  be  an  execution,  1(a)  =  (z'o,  ...,4-i)  hs  index 
sequence,  and  M(a )  =  (m o, . . .  ,m^_1)  its  message  sequence.  The  happens-before  relation 
M(a)  x  M(a)  is  the  smallest  relation  that  satisfies  the  following  conditions: 

•  for  all  /,  J  G  {0, . . .  ,k  —  1}:  if  j  <  l  and  q  =  ij,  then  nij  mi 

•  for  all  j  G  {0,...,k  —  1}:  if  A,  (my,  local  (actor  (sj,ij)))  =  ((q[, . . .  ,q'n),  l'),  then 
nij  m'j  for  all  m'  G  U/Li  <7/ 

•  for  all  71,72,73  G  {0, . . . , k  -  1}:  if  mh  rn]2  and  mj2  -w*  rn]3,  then  mh  m;-3 

Definition  3.1.6.  Let  a  =  (so,  -  -  - ,  Sfc)  be  an  execution,  7(a)  =  (z'o, . . .  ,4-l)  its  in¬ 
dex  sequence,  and  M(a)  =  (mo, . . .  ,z«*.-i)  its  message  sequence.  A  dependence  re¬ 
lation  is  a  relation  Da  C  M(a)  x  M(a)  such  that  for  all  71,72  G  {0, . . .  ,k  —  1}:  if 
there  exists  z  G  {l,...,n}  such  that  Ai^(mj1,local(actor(sj1,ij1)))  =  ((q\,  ■  ■  ■  ,qn),l), 
Aj.r}(nij2,local(actor(sj2,ij2 )))  =  ((q[, ...  ,q'n),V),  and  qi  and  are  a  non-empty  sequence, 
then  Da(mh,m]2). 

The  motivation  behind  the  definition  of  dependence  is  to  identify  pairs  of  messages 
with  possibly  non-commutative  effects.  In  particular,  if  Da(mi,m2),  then  in  the  course 
of  the  execution  a,  handling  of  messages  m\  and  m2  adds  messages  to  overlapping  sets 
of  message  queues  and  the  order  in  which  messages  m\  and  m2  are  handled  may  affect 
the  outcome  of  the  actor  program. 

Note  that  the  above  definition  of  dependence  produces  a  valid  dependence  rela¬ 
tion  [60]  but  not  necessarily  a  minimal  one.  Thus,  the  dependence  relation  computed 
based  on  Definition  3.1.6  may  be  an  over-approximation  of  the  true  dependence.  In 
contrast  to  that,  the  happens-before  relation  computed  based  on  Definition  3.1.5  may  be 
an  under-approximation  of  the  true  causality. 

The  nature  of  the  DPOR  algorithm  justifies  such  approximations.  The  smaller  the 
dependence  relation  is  and  the  larger  the  happens-before  relation  is,  the  more  states  the 
DPOR  algorithm  can  reduce.  As  long  as  the  dependence  relation  does  not  fail  to  relate 
any  messages  that  are  in  fact  dependent,  and  the  happens-before  relation  does  not  relate 
any  messages  that  are  in  fact  concurrent,  the  DPOR  algorithm  will  work  correctly. 
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In  practice,  one  aims  to  strike  a  balance  between  accurate  identification  of  de¬ 
pendence  and  causality  and  the  overhead  of  doing  so.  The  above  definitions  of  the 
happens-before  and  dependence  relations  are  crafted  so  that  these  relations  can  be  easily 
computed  at  runtime,  avoiding  the  need  for  static  analysis  or  program  annotations. 

3.1.2  Implementation 

The  runtime  of  Omega  actor  programs  controls  the  order  in  which  messages  are 
handled  through  an  actor  manager.  In  production,  the  actor  manager  is  multithreaded, 
concurrently  processing  messages  from  different  message  queues.  In  contrast  to  that, 
when  testing  Omega's  actor  programs,  a  sequential  actor  manager  that  enables  fine¬ 
grained  control  of  message  ordering  is  used  instead.  This  approach  to  testing  relies 
on  the  assumption  that  the  actor  managers  used  in  production  and  in  testing  are 
functionally  equivalent. 

To  enable  systematic  testing  of  Omega  actor  programs,  ETA  implements  a  new 
exploratory  actor  manager.  Similar  to  the  default  actor  manager  for  testing,  the  exploratory 
actor  manager  serializes  the  handling  of  concurrent  messages.  Unlike  the  default  actor 
manager  for  testing,  which  is  deterministic,  different  executions  of  the  same  actor 
program  test  under  the  exploratory  actor  manager  can  explore  different  message  orders 
and  program  states. 


Exploration 

To  explore  different  program  states,  ETA  repeatedly  executes  the  same  test  and  uses 
stateless  exploration  to  navigate  the  state  space.  In  other  words,  ETA  runs  the  actual 
actor  program  but,  with  the  exception  of  the  current  program  state,  it  does  not  store  the 
program  states  explicitly.  Instead,  program  states  are  represented  implicitly  using  index 
sequences  of  executions  that  lead  to  them. 

The  program  states  revealed  by  test  executions  are  recorded  and  stored  in  an  execution 
tree,  with  nodes  implicitly  representing  program  states  and  edges  representing  index 
sequence  elements.  Initially,  the  exploration  holds  no  knowledge  about  the  structure  of 
the  execution  tree  and  as  new  message  sequences  are  explored,  the  execution  tree  is 
gradually  unfolded. 

An  execution  of  an  actor  program  test  in  ETA  uses  the  execution  tree  to  generate 
an  index  sequence  which  identifies  a  node  of  the  execution  tree  exploration  frontier 
to  explore  next.  The  exploratory  actor  manager  then  uses  this  index  sequence  to 
steer  the  execution  towards  this  node.  Once  this  node  is  reached,  the  exploratory 
actor  manager  switches  to  round-robin  scheduling,  recording  its  decisions.  Once  the 
test  execution  completes  (or  times  out),  ETA  processes  the  explored  message  sequence, 
computing  the  happens-before  relation  (cf.  Definition  3.1.5)  and  the  dependence  relation 
(cf.  Definition  3.1.6),  and  uses  the  DPOR  algorithm  (cf.  Section  2.1.4)  to  update  the 
exploration  frontier. 


18 


Nondeterminism 


In  order  for  ETA  to  navigate  the  execution  tree,  the  message  queue  identifiers  of  the 
Omega  actors  runtime  must  be  identical  across  different  test  executions  (property  of  the 
actors  library),  and  identical  index  sequences  must  produce  identical  message  queues 
contents  across  different  test  executions  (property  of  the  actor  program  and  its  test). 

To  accomplish  the  latter,  ETA  needs  to  make  sure  that  message  ordering  is  indeed 
the  only  source  of  nondeterminism.  Otherwise,  the  exploratory  actor  manager  might 
not  be  able  to  deterministically  replay  previously  explored  message  orders,  which  is 
necessary  for  stateless  exploration.  To  meet  this  requirement,  ETA  uses  deterministic 
seeds  for  pseudo-random  number  generators  and  deterministic  mock  implementations 
of  nondeterministic  components  of  the  environment  such  as  RPC  servers. 

The  broader  notion  is  to  treat  all  nondeterminism  as  messages  but  only  explore  the 
ones  the  user  considers  most  pertinent.  This  notion  could  be  implemented  as  a  more 
general  exploratory  manager  that  controls  the  different  sources  of  nondeterminism.  The 
main  advantage  of  a  more  general  exploratory  manager  is  its  ability  to  test  interactions 
between  the  different  sources  of  nondeterminism.  The  main  disadvantage  is  the  ad¬ 
ditional  combinatorial  explosion  of  the  number  of  possible  interactions,  impeding  the 
ability  of  the  more  general  exploratory  manager  to  investigate  all  interactions.  Given 
the  size  of  the  state  spaces  of  actor  program  tests  in  the  Omega  test  suite  (see  Chapter  4), 
ETA  focuses  on  message  ordering  only. 


Deployment 

The  ultimate  goal  of  systematic  testing  is  to  help  software  engineers  to  test  their 
programs  better.  Keeping  this  goal  in  mind,  ETA  is  designed  to  smoothly  integrate  with 
the  infrastructure  for  testing  Omega  actor  programs. 

Typically,  an  Omega  actor  program  test  sets  up  some  initial  state  and  then  repeatedly 
uses  the  Omega  actors  library  API  to  either  trigger  message  handling  or  to  inspect  the 
current  program  state  for  errors.  The  message  handling  triggers  can  instruct  the  actor 
manager  to  handle  one  message  or  to  continue  handling  messages  until  a  temporal 
property  [99]  becomes  true  (or  there  are  no  messages  left). 

To  automate  the  testing  process.  Omega  uses  the  Google  Test  framework  [66].  This 
framework  provides  a  number  of  test  macros  through  which  a  software  engineer 
describes  a  unit  test  and  the  framework  then  executes  these  tests  and  reports  the  results. 
To  add  support  for  systematic  testing  into  the  existing  infrastructure  for  testing  Omega, 
the  Google  Test  framework  was  extended  by  this  thesis  author  with  new  test  macros, 
one  for  each  original  test  macro,  that  can  be  used  as  drop-in  replacements  for  the 
original  Google  Test  macros.  By  default  the  new  test  macros  behave  identically  to  their 
Google  Test  counterparts,  executing  the  body  of  the  test  macro  once.  However,  if  the 
test  is  executed  with  a  special  command-line  flag,  systematic  testing  is  used,  repeatedly 
executing  the  body  of  the  test  until  the  execution  tree  representing  the  state  space  of  the 
test  body  is  fully  explored  or  a  timeout  is  reached. 
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Runtime  Estimation 


For  traditional  tests,  runtime  can  be  estimated  using  a  back  of  the  envelope  analysis 
based  on  the  program  design  and  hardware  architecture.  In  contrast,  the  runtime 
of  systematic  tests  depends  largely  on  the  interaction  between  1)  the  combinatorial 
explosion  of  the  number  of  ways  in  which  concurrent  events  of  the  test  can  execute  and 
2)  the  realized  state  space  reduction.  This  quality  of  systematic  tests  makes  estimation 
of  their  runtime  challenging.  In  fact,  prior  to  ETA,  no  systematic  testing  tool  had  the 
capability  to  estimate  the  runtime  of  its  tests. 

The  absence  of  runtime  estimates  leads  to  ad-hoc  and  possibly  inefficient  allocations 
of  scarce  testing  resources.  Recognizing  this  shortcoming  of  existing  systematic  testing 
tools,  ETA  implements  a  novel  technique  for  runtime  estimation  (cf.  Chapter  4).  This 
technique  predicts  runtimes  for  systematic  Omega  tests  and  enables  efficient  allocation 
of  testing  resources. 


Parallel  State  Space  Exploration 

In  addition  to  the  traditional  sequential  implementation  of  the  DPOR  algorithm,  ETA 
implements  a  novel  scalable  version  of  the  DPOR  algorithm  (cf.  Chapter  5),  which  ex¬ 
plores  the  state  space  of  possible  program  states  concurrently  using  a  number  of  parallel 
test  executions.  Although  concurrent  exploration  of  the  execution  tree  seems  straight¬ 
forward  at  first  sight,  a  naive  implementation  may  result  in  redundant  exploration  [167] 
and  working  out  the  details  in  the  context  of  DPOR  requires  some  care. 

ETA's  scalable  implementation  of  the  DPOR  algorithm  enables  concurrent  explo¬ 
ration  of  systematic  Omega  tests  using  Google  data  centers  [65].  The  technique  achieves 
strong  scaling  and  pushes  the  practical  limits  of  systematic  testing  by  orders  of  magni¬ 
tude. 


3.2  dBug 

This  section  describes  dBug  [135, 137],  a  tool  for  systematic  testing  of  multithreaded  and 
distributed  programs  of  POSIX-compliant  operating  systems  [119].  Unlike  ETA,  dBug 
is  publicly  available  [135]  and  is  not  restricted  to  a  particular  programming  language. 
The  motivation  behind  creating  dBug  is  not  only  to  provide  a  platform  for  carrying 
out  systematic  testing  research,  but  also  to  provide  a  practical  tool  that  can  be  used  for 
systematic  testing  of  a  wide  range  of  existing  programs. 

Similar  to  the  presentation  of  ETA,  the  presentation  of  dBug  first  describes  a  formal 
framework  used  by  dBug  to  model  concurrent  programs  (§  3.2.1)  and  then  discusses 
notable  aspects  of  dBug's  design  (§  3.2.2)  and  implementation  (§  3.2.3). 
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3.2.1  Program  Model 

This  subsection  defines  a  formal  framework  for  modeling  real-world  concurrent  pro¬ 
grams.  Similar  to  the  framework  used  by  ETA,  the  purpose  of  this  framework  is  to 
capture  causality  and  dependence  between  program  state  transitions  so  that  dBug  can 
implement  efficient  state  space  exploration  based  on  dynamic  partial  order  reduction 
(DPOR)  [51].  Unlike  the  framework  used  by  ETA,  the  framework  used  by  dBug  is  not 
tailored  to  a  specific  programming  paradigm  and  is  capable  of  modeling  concurrent 
programs  at  different  levels  of  abstraction. 

Definition  3.2.1.  An  abstract  state  s  is  an  element  of  the  S  =  Gg  x  V (Identifiers  x  Ls) 
universe,  where  Gg  :  Go  — >  Values  is  the  global  state  function  that  assigns  values  to 
global  objects  Gq,  Identifiers  is  a  set  of  thread  identifiers,  Lg  :  Lq  — »■  Values  is  the  local  state 
function  that  assigns  values  to  local  objects  Lq,  and  V(X)  denotes  the  power  set  of  X. 
Given  an  abstract  state  s,  it  can  be  inspected  using  the  following  projections: 

•  global  :  S  — >  Gg,  a  function  such  that  global ((g,  {(id\,h),. . .,  ( idn,ln )}))  =  g 

•  local :  S  x  Identifiers  — »  {_!_}  U  Lg,  a  function  such  that: 


local ((g,  {(id1,l1),. .  .,(idn,  /„)}),  id) 


lj  hid  —  i  d  i 
_L  otherwise 


Definition  3.2.2.  An  abstract  program  is  a  tuple  P  =  (S,A,<J>),  where  S  is  a  set  of 
abstract  states,  A  C  S  x  S  is  a  transition  relation,  with  elements  referred  to  as  transitions, 
and  <f>  :  A  — »  V(Go)  x  V(G0)  x  V  (Identifiers  x  V(Lq))  is  an  abstract  footprint  function. 
Further,  given  an  abstract  program  P,  it  can  be  inspected  using  the  following  projections: 

•  threads  :  A  — *  V (Identifiers) ,  a  function  such  that  threads((s,s '))  =  {id  |  there  exist 
gr,gw  €  V(G0 ),  id1,...,idn  G  Identifiers,  h,...,ln  G  V(L0),  and  i  G  {1 ,...,«} 
such  that  0((s, s'))  =  (gr, gw,  {(idi,li), . . . ,  (id„,ln)})  and  id  =  id{ } 

•  readset :  A  — »  V(Go),  a  function  such  that  readset((s,s'))  =  gr  such  that  0((s,  s'))  = 
(gr,gw,{(idi,h),...,(idn,ln)})  for  some  gw  G  V(G0),  id\, . . .  ,idn  G  Identifiers,  and 
hr  ■  ■  ■  /  hi  £  d^(Lo) 

•  writeset  :  A  — »  V(Go),  a  function  such  that  writeset((s,s '))  =  gw  such  that 
<j>((s,s/))  =  (gr,gw,{(id\,h),-  ■  ■  ,(idn,ln)})  for  some  G  V(G0),  idi,...,idn  G 
Identifiers,  and  hr  -  ■  ■  ,l„  G  V(Lq ) 

Note  that  the  above  definitions  intentionally  leave  several  terms  undefined.  The  set 
Go  is  assumed  to  be  an  abstract  representation  of  the  objects  shared  among  the  threads, 
the  Lq  set  is  assumed  to  be  an  abstract  representation  of  the  local  objects  of  each  thread, 
the  Values  set  is  assumed  to  be  a  set  of  values  the  global  and  local  objects  can  take  on, 
and  the  Identifiers  set  is  assumed  to  be  a  set  of  unique  thread  identifiers.  The  transition 
relation  A  abstracts  actual  program  transitions  and  the  abstract  footprint  function  <f> 
abstracts  how  global  and  local  objects  are  accessed. 

The  motivation  behind  the  above  definitions  is  to  formalize  a  mechanism  that  makes 
it  possible  to  model  concurrent  programs  at  different  levels  of  abstraction  captured  by 
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1  #include  <pthread.h> 

2  #include  <stdio.h> 

3 

4  int  x  =  0; 

5 

6  void  *foo (void  *args)  { 

7  x++; 

8  return  NULL; 

9  } 

10 

n  int  main (int  argc,  char  **argv)  { 

12  pthread_t  tid; 

13  pthread_create ( &tid,  NULL,  foo,  NULL); 

14  x++; 

is  pthread_ join  (tid,  NULL); 

16  assert  (x  ==  2  )  ; 

17  return  0; 
is  } 


Figure  3.1:  Concurrent  Memory  Access  Example  -  Source  Code 


the  choice  of  global  and  local  objects.  The  following  paragraphs  illustrate  the  above 
abstract  definitions  on  concrete  examples. 

Example  1.  Consider  the  C  program  depicted  in  Figure  3.1.  This  program  spawns  a 
thread  and  then  concurrently  increments  a  global  variable  x  from  both  threads.  To 
abstractly  model  the  state  space  of  possible  program  states,  one  can  set  Gg  to  consist  of 
assignments  to  the  variable  x,  Lg  to  consist  of  control  flow  locations  identified  by  source 
code  line  numbers,  and  Identifiers  to  consist  of  identifiers  {parent,  child} .  The  state  space 
generated  by  this  abstraction  is  depicted  as  a  graph  in  Figure  3.2.  The  nodes  represent 
abstract  program  states  and  the  edges  represent  abstract  program  transitions,  labeled  by 
values  of  the  threads,  zvriteset,  and  readset  functions  respectively.  This  example  illustrates 
1)  how  to  abstractly  model  program  states  of  a  concrete  program,  2)  how  to  model 
scheduling  nondeterminism  using  a  transition  relation,  and  3)  how  to  model  the  effects 
of  program  transitions  using  the  abstract  footprint  function.  Note  that  choosing  line 
numbers,  as  opposed  to  instruction  register  values,  as  the  abstraction  for  control  flow 
locations  produces  a  model  that  does  not  contain  a  program  state  that  witnesses  the 
data  race  feasible  when  executing  the  program  on  hardware  that  does  not  increment  x 
atomically. 

Example  2.  Consider  the  MPI  [103]  program  depicted  in  Figure  3.3.  When  MPI  creates 
two  instances  of  this  program,  the  instances  exchange  a  message.  To  abstractly  model 
the  state  space  of  possible  program  states,  one  can  set  Gg  to  consist  of  possible  contents 
of  the  buffer  MPI  uses  to  transfer  a  message,  Lg  to  consist  of  MPI  function  invocations 
identified  by  source  code  line  numbers,  and  Identifiers  to  consist  of  identifiers  {p0,pl}. 
The  state  space  generated  by  this  abstraction  is  depicted  as  a  graph  in  Figure  3.4.  The 
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Figure  3.2:  Concurrent  Memory  Access  Example  -  State  Space 

nodes  represent  abstract  program  states  and  the  edges  represent  abstract  transitions, 
labeled  by  values  of  the  threads,  zvriteset,  and  readset  functions  respectively.  This  example 
illustrates  how  to  model  nondeterminism  in  the  MPI  specification  -  messages  can  be 
exchanged  synchronously  or  asynchronously  -  using  a  transition  relation. 

Examples  1  and  2  together  illustrate  how  the  formalism  introduced  by  Defini¬ 
tions  3.2.1  and  3.2.2  enables  modeling  concurrent  programs  at  different  level  of  abstrac¬ 
tion.  Note  that  the  level  of  abstraction  affects  the  complexity  and  precision  of  program 
analysis.  The  more  details  are  abstracted  away,  the  smaller  the  number  of  program 
states  to  analyze  but  the  lower  the  precision  with  which  the  program  is  analyzed.  The 
key  to  effective  program  analysis  is  to  strike  a  balance  between  these  two  opposing 
goals:  avoiding  omission  of  important  program  behaviors  or  creation  of  overly  detailed 
models  that  are  too  large  to  be  analyzed  [24] .  dBug  approaches  this  problem  by  offering 
different  levels  of  abstraction  at  which  it  can  model  concurrent  programs  (cf.  Chapter  7). 
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1  #include  <mpi.h> 

2  #include  <stdio.h> 

3  #include  <string.h> 

4 

5  #define  SIZE  32 

6 

7  int  main(int  argc,  char  **argv)  { 

8  char  message [SIZE] ; 

9  int  myrank; 

10  MPI_Init  (Sargc,  Sargv)  ; 

n  MPI_Comm_rank (MPI_COMM_WORLD,  &myrank) ; 

12  if  (myrank  ==  0)  { 

13  snprintf (message,  SIZE,  "Hello!"); 

MPI_Send (message,  strlen (message)  +  1,  MPI_CHAR,  1,  0,  MPI_COMM_WORLD) ; 

15  } 

16  if  (myrank  ==  1 )  { 

17  MPI_Status  status; 

MPI_Recv (message,  SIZE,  MPI_CHAR,  0,  0,  MP I_COMM_WORLD ,  &status); 

19  printf  (  "%s\n"  ,  message); 

20  } 

21  MPI_Finalize ( )  ; 

22  return  0; 

23  } 


Figure  3.3:  MPI  Communication  Example  -  Source  Code 

Given  a  level  of  abstraction,  dBug  uses  sequences  of  abstract  states  to  model  concrete 
executions  of  a  concurrent  program  and  state  space  exploration  based  on  the  DPOR 
algorithm  to  explore  different  scenarios  allowed  by  scheduling  nondeterminism.  Similar 
to  the  presentation  of  actor  programs,  definitions  of  the  happens-before  and  dependence 
relations  are  necessary  to  adapt  the  DPOR  algorithm  for  abstract  programs. 

Definition  3.2.3.  Let  P  =  (S,  A,<4>)  be  an  abstract  program  and  s  G  S  an  abstract  state. 
An  execution  of  P  from  s  is  defined  as  a  finite  sequence  a  =  (so, . . .  ,S]f)  of  abstract  states 
of  S  such  that  s  =  So  and  for  all  i  G  {0, . . . ,  k  —  1}  :  (s;,  s;+ 1)  G  A.  Aa  is  used  to  denote 
the  set  {(S;,s,+i)  |  i  G  {0, . . . ,k  —  1}}. 

To  compute  the  happens-before  relation  for  executions  of  abstract  programs,  dBug 
uses  a  clock  function  that  for  each  thread  and  global  object  maintains  a  vector  of  the  most 
recently  observed  values  of  logical  time  of  each  thread.  In  particular,  when  a  thread  or  a 
global  object  is  involved  in  an  abstract  program  transition,  the  clock  function  updates  the 
vector  of  the  logical  time  values  maintained  by  this  thread  or  global  object  as  follows. 
First,  the  most  recently  observed  values  of  all  threads  and  global  objects  involved  in 
the  same  abstract  program  transition  are  joined  and  then  each  thread  involved  in  an 
abstract  program  transitions  advances  its  own  logical  time  by  one. 

Definition  3.2.4.  Let  P  =  (S,  A,  <E>)  be  an  abstract  program.  Go  its  set  of  global  objects, 
and  oc  =  (so, . . . ,  S/c)  its  execution.  The  clock  function  clock  :  S  x  ( Identifiers  U  Go)  x 
Identifiers  — »  N  is  a  function  that  meets  the  following  conditions: 
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Figure  3.4:  MPI  Communication  Example  -  State  Space 

1.  for  all  g  £  Go  and  id,  id'  £  Identifiers  :  clock(so,id,id')  =  clock(so,g,id')  =  0 

2.  for  all  z^  £  Identifiers  and  i  £  {1, . . . ,  /c}  : 

{join(id2,Si_i,Si)  +  1  if  zdi  £  threads(si_\,Si)  and  zdi  =  zd2 
join(id2,Sj_i,Si)  \Iid\  £  threads{si_i,Si )  and  z'di  7^  id2 

clock(sj_i,idi,id2)  otherwise 

3.  for  all  g  £  Go,  frf  £  Identifiers,  and  i  £  {1 , ...  ,k]  : 

c/oc/c(s  •  id)  =  gezvriteset(s{^,Si) 

)clock(si-i,  g,id)  otherwise 

where  join(id,s, s')  =  max {{clock^s, id' , id)  \  id'  £  threads (s, s')}  U 

{clock(s,g,id)  |  g  £  (: readset(s,s ')  U  zvriteset(s,s'))}}. 

Definition  3.2.5.  Let  F  =  (S,A,<J>)  be  an  abstract  program  and  a  =  (so,...,Sjt)  its 
execution.  The  happens-before  relation  -waC  Aa  x  is  a  binary  relation  that  meets  the 
following  condition:  for  all  (sf,s-),  (. Sj,s'j )  £  Aa  :  (s„  s'f  ( Sj,s'- )  if  and  only  if  for  all 

idi  £  threads (sj,sj)  and  idj  £  threads (s y ,  s')  : 
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Figure  3.5:  Concurrent  Memory  Address  Example  -  Execution 

1.  for  all  id  £  Identifiers  :  clock(si,idirid )  <  clock(sj,idj,id )  and 

2.  there  exists  zd  £  Identifiers  such  that  clock(sj,idi,id )  <  clock(sj,idj,id). 

Further,  when  a  pair  of  transitions  is  not  related  through  the  happens-before  relation, 
the  transitions  are  called  concurrent. 
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Example  3.  Consider  again  the  C  program  depicted  in  Figure  3.1  and  its  execution 
depicted  in  Figure  3.5,  which  annotates  each  abstract  state  with  appropriate  values  of 
the  clock  function.  The  first  argument  of  the  clock  function  is  identified  by  the  abstract 
state  each  annotation  table  is  attached  to  and  the  second  and  the  third  argument  are 
identified  by  the  column  and  the  row  of  each  annotation  table  respectively.  Notably,  this 
example  illustrates  how  logical  time  is  propagated  through  transitions  that  modify  local 
state  of  multiple  threads  or  the  global  state.  The  transitions  of  the  depicted  execution  <x 
are  numbered  and  the  values  of  the  clock  function  can  be  used  to  compute  the  happens- 
before  relation  {(1,2),  (2,3),  (3,5),  (3,6),  (4,5),  (6,7),  (7,8)}+,  where  X+  denotes 
the  transitive  closure  of  X.  Note  that  the  transitions  3  and  4  both  increment  the  global 
variable  x  and  are  not  causally  ordered  by  Consequently,  although  the  abstract  state 
space  does  not  contain  an  abstract  state  that  witnesses  the  data  race  feasible  in  practice, 
the  happens-before  relation  along  with  the  abstract  footprint  function  <J>  can  be 
used  to  detect  the  data  race.  Further,  note  that  since  the  abstract  program  does  not  keep 
track  of  the  status  of  the  child  thread  as  part  of  the  global  state,  the  happens-before 
relation  fails  to  capture  the  causality  implied  by  joining  the  child  thread  from  the 
parent  thread.  In  other  words,  the  happens-before  relation  of  the  abstract  program  is  in 
general  an  under-approximation  of  the  true  causality  between  program  transitions. 

Next,  a  dependence  relation  [60]  between  transitions  of  an  abstract  program  execu¬ 
tion  is  defined  to  track  transitions  with  non-commutative  effects. 

Definition  3.2.6.  Let  F  =  (S,A,<J>)  be  an  abstract  program  and  a  =  (sq,  .  ..,Sfc)  its 
execution.  The  dependence  relation  D„  C  A,  x  Aa  is  a  binary  relation  such  that  for  all 
(sf,s(),  (. Sj,s'j )  £  Aa  :  ((s;,s(),  (sy,Sy))  £  Da  if  any  of  the  following  conditions  is  true: 

1.  ivriteset(si,S; )  and  writeset(sj,s'j )  are  not  disjoint 

2.  readset(si,s'i)  and  writeset(sj,s^)  are  not  disjoint 

3.  writeset(si,s'j )  and  readset(sj,s'j )  are  not  disjoint 

4.  threads and  threads  (sj,s'j)  are  not  disjoint 

Further,  when  a  pair  of  transitions  is  related  through  the  dependence  relation,  the 
transitions  are  called  dependent. 

Unlike  the  dependence  relation  for  actor  programs  (Definition  3.1.6),  the  dependence 
relation  for  abstract  programs  (Definition  3.2.6)  can  fail  to  relate  transitions  with  non- 
commutative  effects.  This  can  happen  when  the  state  modified  by  both  transitions  has 
been  abstracted  away.  Note  that  this  is  not  a  shortcoming  of  the  above  definition.  If 
the  abstract  program  modeled  every  bit  of  the  concrete  program  state,  the  dependence 
relation  would  likely  relate  every  pair  of  program  transitions.  After  all,  every  program 
transition  modifies  the  contents  of  the  instruction  register.  Consequently,  the  DPOR 
algorithm  would  fail  to  achieve  any  reduction,  rendering  the  systematic  testing  approach 
impractical  for  any  real-world  program.  Instead,  the  abstraction  of  the  concrete  program 
state  identifies  state  pertinent  to  the  inspection  of  the  desired  program  behavior.  The 
dependence  relation  computed  based  on  Definition  3.2.6  simply  represents  a  mechanism 
that  passes  this  information  onto  the  DPOR  algorithm. 
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Figure  3.6:  Concurrent  Memory  Address  Example  -  Execution  Tree 


In  practice,  dBug  executes  the  concrete  program  and  the  DPOR  algorithm  im¬ 
plemented  in  dBug  inspects  an  abstraction  of  this  execution,  searching  for  pairs  of 
transitions  that  are  both  concurrent  and  dependent.  A  pair  of  such  transitions  suggests 
that  the  transitions  could  execute  in  either  order  and  the  effects  of  the  transitions 
may  not  be  commutative,  possibly  affecting  the  outcome  of  the  test.  When  the  DPOR 
algorithm  encounters  a  pair  of  such  transitions,  the  exploration  frontier  of  the  DPOR 
algorithm  is  updated  to  make  sure  that  all  permutations  of  concurrent  and  dependent 
transitions  are  eventually  explored. 

Example  4.  Consider  again  the  C  program  depicted  in  Figure  3.1.  The  execution 
tree  depicted  in  Figure  3.6  illustrates  the  use  of  the  DPOR  algorithm  to  explore  the 
abstract  state  space  depicted  in  Figure  3.2.  Let  us  assume  that  the  DPOR  algorithm  first 
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explores  the  execution  corresponding  to  the  leftmost  branch  of  the  execution  tree.  This 
happens  to  be  the  same  execution  as  depicted  in  Figure  3.5.  In  Example  3,  we  found 
out  that  the  transitions  3  and  4  are  concurrent.  Comparing  their  abstract  footprints 
reveals  that  they  are  also  dependent.  Consequently,  the  DPOR  algorithm  infers  that 
it  is  necessary  to  explore  an  execution  that  permutates  these  two  transitions.  Let  us 
assume  that  to  that  end,  the  DPOR  algorithm  explores  the  execution  corresponding  to 
the  middle  branch  of  the  execution  tree  depicted  in  Figure  3.6.  The  DPOR  algorithm 
then  examines  this  execution  and  finds  out  that  transitions  4  and  3  are  concurrent  and 
dependent.  Consequently,  the  DPOR  algorithm  infers  that  it  is  necessary  to  explore 
an  execution  that  permutates  these  two  transitions.  As  this  has  already  happened, 
the  DPOR  algorithm  finishes  the  exploration.  Notably,  the  DPOR  algorithm  avoids 
exploration  of  the  rightmost  branch  of  the  execution  tree  depicted  in  Figure  3.6. 

3.2.2  Design 

The  goal  of  dBug  is  to  enable  systematic  testing  of  multithreaded  and  distributed 
programs.  To  this  end,  dBug  needs  to  be  able  to  monitor  the  program  state,  to  control 
the  order  of  concurrent  program  transitions,  and  to  deterministically  replay  parts  of 
program  executions.  These  three  requirements  map  to  the  design  of  the  following 
dBug  components.  The  interposition  layer  is  responsible  for  monitoring  the  program 
state  and  controlling  the  execution  of  program  transitions.  The  arbiter  is  responsible  for 
maintaining  the  abstract  program  state,  identifying  what  transitions  can  be  executed 
next,  and  scheduling  these  transitions.  The  explorer  is  responsible  for  exploring  the  state 
space  of  abstract  program  states. 

Interposition  Layer 

The  interposition  layer  monitors  a  concrete  program  execution  for  events  that  affect 
the  abstract  state  maintained  by  the  arbiter.  When  a  thread  is  about  to  execute  such  an 
event,  the  interposition  layer  intercepts  this  event,  suspends  the  execution  of  the  thread, 
and  informs  the  arbiter  about  this  event.  At  some  later  point,  the  arbiter  instructs  the 
interposition  layer  to  resume  execution  of  the  suspended  thread. 

The  interposition  layer  receives  its  name  from  the  mechanism  intended  for  its 
deployment.  As  Figure  3.7  suggests,  an  interposition  layer  is  assumed  to  exist  between 
the  program  and  the  environment  in  which  the  program  runs,  providing  a  mechanism 
to  monitor  and  control  the  interactions  of  the  program  and  the  environment. 

Arbiter 

The  arbiter  uses  a  client-server  architecture  (Figure  3.8)  to  collect  information  about 
the  abstract  state  of  the  program  through  the  events  intercepted  by  the  interposition 
layer.  To  control  scheduling  nondeterminism,  the  arbiter  waits  until  all  threads  cease 
execution  either  by  trapping  into  the  interposition  layer  or  simply  terminating.  The 
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Figure  3.7:  Monitoring  Program  Behavior  with  dBug  Interposition  Layer 


arbiter  then  enumerates  all  transitions  that  are  possible  from  the  current  abstract  state 
and  selects  one  of  them.  The  selected  transitions  identifies  one  or  more  threads  whose 
execution  is  then  resumed.  This  process  is  repeated  until  all  threads  terminate  or  a 
stopping  condition,  such  as  reaching  a  deadlock,  is  encountered. 

Explorer 

To  explore  the  space  of  possible  abstract  states  of  a  program  test,  the  explorer  repeatedly 
starts  an  instance  of  the  arbiter  and  the  program  test  with  the  interposition  layer  in  place 
and  waits  for  the  program  test  to  finish.  Once  the  program  test  finishes,  the  explorer 
collects  information  about  the  sequence  of  abstract  states  explored  by  the  arbiter.  This 
information  is  used  to  represent  the  state  space  as  an  execution  tree  and  to  compute  the 
happens-before  relation  and  the  dependence  relation  in  order  to  explore  the  execution 
tree  using  the  DPOR  algorithm.  Once  an  execution  is  processed  by  the  explorer,  an 
unexplored  node  on  the  exploration  frontier  is  identified  and  used  to  create  a  schedule 
for  the  arbiter  which  will  steer  the  next  execution  towards  unexplored  parts  of  the  state 
space.  Figure  3.9  illustrates  how,  over  time,  the  explorer  communicates  with  different 
instances  of  the  arbiter,  while  exploring  the  execution  tree. 

3.2.3  Implementation 

Interposition  Mechanism 

In  the  course  of  dBug's  lifetime,  several  mechanisms  for  the  implementation  of  the  inter¬ 
position  layer  have  been  experimented  with.  For  example,  early  dBug  prototypes  [136] 
relied  on  manual  source  code  annotations.  This  process  was  time  consuming,  error- 
prone,  and  required  access  to  and  understanding  of  the  program  source  code. 

To  avoid  these  problems,  the  mechanism  has  since  evolved  to  its  current  form  [135, 
137]  that  relies  on  runtime  interposition  on  dynamically  linked  symbols.  This  mecha¬ 
nism  is  well  aligned  with  dBug's  goal  to  offer  systematic  testing  at  different  levels  of 
abstraction,  which  are  provided  by  the  function  prototypes  of  the  interposed  functions. 
Further,  runtime  interposition  avoids  the  need  for  having  access  to  the  program  source 
code  and  works  irrespective  of  the  choice  of  programming  language. 

The  main  drawback  of  using  runtime  interposition  is  its  inability  to  track  memory 
accesses,  limiting  the  precision  of  program  abstraction  and  analysis.  Although  tracking 
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Figure  3.8:  Controlling  Scheduling  Nondeterminism  with  dBug  Arbiter 


Figure  3.9:  Exploring  Execution  Tree  with  dBug  Explorer 


of  memory  accesses  is  possible  with  alternative  approaches  to  runtime  interposition 
such  as  dynamic  instrumentation  [98]  and  virtualization  [11,  28],  tracking  memory 
accesses  has  its  own  drawbacks.  Namely,  modeling  concurrent  program  transitions  at 
the  granularity  of  (shared)  memory  accesses  exacerbates  the  combinatorial  explosion  of 
the  number  of  scenarios  to  explore.  In  addition  to  that,  using  technology  for  tracking 
memory  accesses  imposes  a  runtime  overhead  on  program  execution  and  this  overhead 
reduces  the  rate  at  which  systematic  testing  explores  the  state  space.  Given  the  extent 
of  the  combinatorial  explosion  without  considering  memory  accesses  (see  Chapters  6 
and  7),  dBug  chooses  to  avoid  tracking  of  memory  accesses,  leaving  the  job  of  searching 
for  low-level  data  races  to  tools  such  as  Eraser  [128],  RaceTrack  [169],  and  PACER  [25]. 
The  analysis  carried  out  by  these  tools  is  orthogonal  to  systematic  testing  and  thus  can 
be  used  in  combination  with  systematic  testing  to  increase  its  precision. 
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Default  Abstraction 


The  default  abstraction  dBug  uses  to  model  program  execution  is  the  POSIX  inter¬ 
face  [119].  Using  this  abstraction  has  several  advantages.  It  is  shared  by  all  programs 
of  POSIX-compliant  operating  systems  and  the  libraries  that  implement  the  POSIX 
interface  typically  use  dynamic  linking,  enabling  runtime  interposition.  Consequently, 
using  the  POSIX  interface  as  the  default  abstraction  allows  dBug  to  target  a  wide 
range  of  programs  without  the  need  to  access  or  modify  their  source  code.  In  practice, 
applications  of  dBug  reported  in  this  thesis  are  done  in  the  context  of  the  operating 
system  Linux. 

In  the  POSIX  abstraction,  the  global  state  keeps  track  of  the  following  objects: 
threads,  processes,  barriers,  condition  variables,  mutexes,  read-write  locks,  semaphores, 
spinlocks,  epoll  descriptors,  pipe  descriptors,  file  descriptors,  and  socket  descriptors. 
The  local  state  of  each  thread  consists  of  its  call  stack  and  dBug  keeps  track  of  the 
values  of  the  clock  function  using  a  vector  clock  [123].  A  detailed  list  of  POSIX  interface 
functions  that  dBug  interposes  on  can  be  found  in  Appendix  A. 


Exploration 

To  explore  the  space  of  possible  abstract  program  states,  dBug  controls  the  order  in 
which  program  threads  execute  concurrent  program  transitions.  To  this  end,  dBug 
keeps  track  of  all  program  threads,  maintaining  this  information  as  part  of  the  global 
state  and  updating  the  global  state  when  relevant  POSIX  interface  invocations,  such  as 
pthread_create  ( )  or  fork  ( ) ,  are  encountered.  Being  knowledgeable  of  the  set  of 
all  program  threads  allows  dBug  to  recognize  when  the  program  reaches  a  quiescent  state 
when  all  program  threads  have  either  terminated  or  trapped  into  the  dBug  interposition 
layer.  When  a  quiescent  state  is  reached,  the  program  waits  for  the  dBug  arbiter  to 
make  a  scheduling  decision. 

The  dBug  arbiter  repeatedly  waits  for  the  program  to  reach  a  quiescent  state, 
enumerates  all  program  transitions  possible  from  that  state,  selects  one  of  the  program 
transitions,  and  resumes  execution  of  the  threads  that  participate  in  that  transition.  This 
process  is  repeated  until  the  program  terminates  or  a  stopping  condition,  such  as  a 
deadlock,  is  encountered.  Note  that  the  algorithm  that  the  dBug  arbiter  uses  to  schedule 
program  transitions  relies  on  the  absence  of  busy  waiting.  If  a  thread  can  run  indefinitely 
without  trapping  into  the  interposition  layer,  the  program  might  never  reach  a  quiescent 
state.  To  identify  busy  waiting,  dBug  uses  a  timeout  and  reports  the  backtrace  of  suspect 
threads,  effectively  enforcing  a  programming  discipline  where  threads  yield  the  CPU 
through  functions  such  as  sleep  ( ) ,  sched_yield  ( ) ,  or  poll  ( ) . 

Once  the  dBug  arbiter  reaches  a  terminal  state,  it  sends  a  description  of  the  sequence 
of  program  transitions  it  explored  and  all  the  alternative  scheduling  choices  it  has 
encountered  to  the  dBug  explorer.  The  dBug  explorer  then  maps  the  explored  sequence 
of  program  transitions  to  a  branch  of  the  execution  tree  and  marks  it  as  explored. 
The  description  of  program  transitions  contains  logical  clock  timestamps  and  abstract 
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footprints  that  the  dBug  explorer  uses  to  compute  the  happens-before  and  dependence 
relations  over  the  set  of  program  transitions.  This  information  is  in  turn  used  by 
dBug's  implementation  of  the  DPOR  algorithm  to  update  the  exploration  frontier  of  the 
execution  tree,  identifying  nodes  of  the  exploration  frontier  that  need  to  be  explored  to 
guarantee  that  all  permutations  of  concurrent  and  dependent  program  state  transitions 
are  explored. 

Next,  the  dBug  explorer  identifies  a  node  of  the  exploration  frontier  to  explore  next. 
This  node  represents  a  sequence  of  scheduling  choices  that  has  not  been  previously 
explored.  The  initial  program  state  is  then  recreated,  either  by  simply  starting  the 
program  anew  or  through  a  user-provided  initialization  function,  and  the  dBug  arbiter 
is  told  what  scheduling  choices  to  make  to  steer  the  execution  towards  the  previously 
identified  node  of  the  exploration  frontier. 

An  important  prerequisite  of  systematic  testing  is  that  the  identification  of  scheduling 
choices  needs  to  be  consistent  across  different  program  executions.  To  this  end,  dBug 
uses  deterministic  numbering  of  threads,  which  is  possible  because  dBug  controls  the 
order  in  which  threads  are  created.  Further,  dBug's  implementation  follows  a  strict 
programming  discipline  to  make  sure  that  identical  scheduling  choices  result  in  identical 
abstract  program  states  across  different  program  executions. 

The  programming  discipline  dictates  that  the  dBug  arbiter  models  each  event  it 
interposes  on  using  three  functions  -  preschedule,  test,  and  execute  -  that  cannot  block 
and  execute  atomically  with  respect  to  one  another.  The  following  paragraphs  detail 
these  functions  and  Figure  3.10  illustrates  the  typical  operation  of  the  dBug  arbiter. 

The  preschedule  function  is  used  to  update  the  global  state  when  the  correspond¬ 
ing  event  is  originally  intercepted.  For  example,  the  preschedule  function  of  the 
pthread_barrier_wait  (b)  event  updates  the  number  of  waiters  of  the  barrier 
b,  while  the  preschedule  function  of  the  pthread_cond_wait  (c,  m)  event  updates  the 
state  of  the  mutex  m.  The  programming  discipline  requires  that  the  effects  different 
preschedule  functions  have  on  the  global  state  are  commutative. 

The  test  function  is  used  to  inspect  the  global  state  and  to  determine  what  pro¬ 
gram  transitions  containing  the  corresponding  event  are  possible.  The  test  func¬ 
tion  computes  a  set  of  program  transitions,  each  containing  its  abstract  footprint 
and  a  seed  that  can  be  used  to  reproduce  the  nondeterministic  choices  made  in  the 
course  of  creating  this  program  transition.  For  example,  the  test  function  of  the 
pthread_barrier_wait  (b)  event  checks  if  the  actual  number  and  target  number 
of  waiters  of  the  barrier  b  match  and  if  so,  generates  n  program  transitions,  one  per 
waiter,  returning  PTHREAD_BARRIER_SERIAL_THREAD  from  different  waiters  (see 
pthread_barrier_wait  ()  man  pages  for  details).  The  programming  discipline 
requires  that  no  test  function  modifies  the  global  state. 

The  execute  function  is  used  to  model  the  effect  that  executing  the  corresponding 
event  has  on  the  global  state.  It  inputs  a  program  transition  seed  and  uses  it  to  reproduce 
any  nondeterministic  choices  in  accordance  with  the  choices  made  by  the  test  function 
that  generated  this  program  transition.  When  a  program  transition  involves  multiple 
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program  transition  on  the  global  state 
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I 


The  dBug  arbiter  resumes  execution  of  some  threads 


Figure  3.10:  Arbiter  Execution  Example 


events,  the  programming  discipline  requires  that  their  respective  execute  functions  are 
executed  in  a  deterministic  order. 

In  summary,  the  programming  discipline  of  dBug  arbiter  makes  sure  that  dBug 
does  not  introduce  any  nondeterminism  into  the  process  of  enumerating  and  executing 
program  transitions.  Its  key  properties  are  that  1)  different  preschedule  functions  are 
commutative,  2)  no  test  function  is  allowed  to  modify  the  global  state,  and  3)  execute 
functions  are  executed  in  a  deterministic  order. 


34 


Figure  3.11:  Direct  Record-Replay  Mechanism  Example 


Nondeterminism 

A  separate  concern  that  dBug  needs  to  address  in  order  to  enable  deterministic  replay 
is  the  input  nondeterminism  coming  from  the  operating  system.  In  general,  a  program 
can  input  values  through  the  functions  it  interposes  on  and  these  values  can  depend 
on  factors  external  to  the  program.  For  example,  the  value  of  the  getpid  ( )  function, 
which  returns  an  integer  identifier  of  a  process,  depends  on  the  context  in  which  the 
program  runs,  while  the  value  of  the  time  ( )  function,  which  returns  the  number  of 
seconds  since  January  1st  1970,  depends  on  the  time  of  the  function  invocation.  If  the 
program  logic  depends  on  these  values,  then  different  program  executions  that  use  the 
same  scheduling  choices  could  eventually  diverge.  To  address  this  problem,  dBug  uses 
three  different  mechanisms:  direct  record-replay,  indirect  record-replay,  and  time  travel. 

The  direct  record-replay  mechanism  builds  on  previous  work  [55,  127]  that  records 
outcomes  of  system  calls  to  enable  deterministic  replay  of  program  execution.  The 
difference  in  the  context  of  dBug  is  that  a  program  transitions  from  a  replay  phase  to  a 
record  phase  in  the  course  of  its  execution.  This  transition  occurs  when  the  program 
execution  finishes  replaying  scheduling  choices  identified  by  the  dBug  explorer,  reaching 
the  execution  tree  exploration  frontier.  This  mechanism  is  used  for  functions,  such 
as  the  random  ( )  function  or  the  read  ( )  function  when  reading  from  /dev/random, 
that  generate  transient  variable  values. 

Example  5.  Figure  3.11  depicts  an  example  of  direct  record-replay  execution.  Initially, 
the  execution  tree  exploration  frontier  maintained  by  dBug  consists  of  the  initial  program 
state  represented  by  the  node  A.  The  underlying  program  is  then  executed  and  the 
explored  sequence  of  scheduling  choices  is  modeled  as  the  branch  A-B-C-D.  In  the 
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Figure  3.12:  Indirect  Record-Replay  Mechanism  Example 


course  of  the  execution,  two  calls  to  the  random  ( )  function  are  encountered  and  their 
outcome  is  recorded  in  the  execution  tree.  The  DPOR  algorithm  is  then  used  to  add 
important  alternative  scheduling  choices,  such  as  the  node  E,  to  the  execution  tree 
exploration  frontier.  A  new  program  execution  is  then  started,  steering  the  execution 
towards  the  node  E  and  the  explored  sequence  of  scheduling  choices  is  modeled  as  the 
branch  A-B-E-F.  In  the  course  of  this  execution,  two  calls  to  the  random  ()  function 
are  encountered  again.  In  contrast  to  the  first  execution,  the  outcome  of  the  first  call 
is  replayed  using  the  outcome  recorded  in  the  first  execution.  The  execution  then 
continues  replaying  scheduling  choices  identified  by  the  dBug  explorer  until  it  reaches 
the  program  state  represented  by  the  node  B  and  then  transitions  from  the  replay  phase 
to  the  record  phase.  From  that  point  the  execution  continues  to  explore  an  arbitrary 
sequence  of  scheduling  choices.  When  the  second  call  to  the  random  ( )  function  is 
encountered,  its  outcome  is  recorded. 

The  direct  record-replay  mechanism  may  fail  when  the  outcome  of  one  interposed 
function  persists  in  the  program  environment  and  can  be  later  used  as  an  input  of 
another  interposed  function.  The  problematic  scenario  occurs  when  the  outcome  of 
the  first  function  call  is  replayed  and  then  the  second  function  call  uses  this  outcome 
to  look  up  information  in  the  environment.  Since  dBug  does  not  control  the  state  of 
the  environment,  the  look  up  may  fail.  To  address  this  problem,  dBug  uses  the  indirect 
record-replay  mechanism.  This  mechanism  maintains  an  indirection  map  that  translates 
environment  values  to  deterministically  generated  unique  values,  which  are  used  for 
the  purpose  of  record-replay.  When  these  deterministic  values  are  used  as  an  input  to 
an  interposed  function,  dBug  uses  the  indirection  map  to  translate  these  values  back 
to  their  environmental  counterpart.  This  mechanism  is  used  for  values  that  serve  as 
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Figure  3.13:  Time  Travel  Mechanism  Example 


environmental  handles  such  as  process  identifiers  of  functions  fork  ( )  and  waitpid  ( ) 
or  socket  descriptors  of  functions  socket  ( )  and  send  ( ) . 

Example  6.  Figure  3.12  depicts  an  example  of  indirect  record-replay.  Similar  to  Exam¬ 
ple  5,  this  example  explores  two  executions,  replaying  some  scheduling  choices  in  the 
course  of  the  second  execution.  The  first  execution  follows  a  call  to  the  fork  ( )  function 
with  a  call  to  the  waitpid  ()  function.  The  true  outcome  of  the  fork  ()  function, 
process  identifier  128,  is  assigned  the  deterministic  value  of  42,  which  is  recorded 
and  returned  to  the  child  process.  When  the  waitpid  ( )  function  later  uses  42  as  an 
input  argument,  the  dBug  interposition  layer  replaces  this  value  with  its  environmental 
counterpart  128.  The  second  execution  also  follows  a  call  to  the  fork  ( )  function  with 
a  call  to  the  waitpid  ( )  function.  In  contrast  to  the  first  execution,  when  the  second 
execution  encounters  the  fork  ()  function  call,  its  true  outcome,  process  identifier 
255,  is  assigned  the  deterministic  value  of  42,  replayed  from  the  first  execution.  When 
the  waitpid  ( )  function  uses  42  as  an  input  argument,  the  dBug  interposition  layers 
replaces  this  value  with  its  environmental  counterpart  255. 

Although  input  nondeterminism  stemming  from  functions  that  return  absolute  time 
values  can  be  handled  by  the  direct  record-replay  mechanism,  the  use  of  the  mechanism 
can  result  in  unlikely  timing  scenarios.  For  example,  consider  a  program  that  uses  a  pair 
of  calls  to  the  gettimeofday  ( )  function,  one  at  the  beginning  of  its  execution  and 
one  at  the  end  of  its  execution,  to  measure  its  runtime.  The  use  of  direct  record-replay 
will  result  in  the  outcome  of  the  first  call  in  the  first  execution  being  replayed  in  all 
of  the  subsequent  executions,  while  the  outcome  of  the  second  call  will  be  generated 
anew  in  every  execution.  Consequently,  the  runtime  measurements  will  grow  more  and 
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more  skewed  with  each  explored  execution.  To  address  this  problem,  dBug  uses  the 
time  travel  mechanism,  which  for  each  node  of  the  execution  tree  keeps  track  of  the  time 
the  node  was  first  encountered,  referred  to  as  the  creation  time.  Similar  to  the  previous 
mechanisms,  the  time  travel  mechanism  uses  a  replay  phase  and  a  record  phase.  The 
replay  phase  is  identical  to  the  replay  phase  of  direct  record-replay  and  the  transition 
to  the  record  phase  occurs  when  the  program  execution  finishes  replaying  scheduling 
choices  identified  by  the  dBug  explorer.  When  the  transitions  happens,  the  creation  time 
of  the  execution  tree  node  corresponding  to  the  current  program  state  is  compared  to 
the  current  time  and  their  difference  is  subtracted  from  outcomes  of  all  subsequent  calls 
to  absolute  time  functions.  This  mechanism  is  used  for  values  of  functions  that  return 
absolute  time  values  such  the  gettimeofday  ( )  function  or  the  time  ( )  function. 
Example  7.  Figure  3.13  depicts  an  example  of  time  travel.  Similar  to  Examples  5  and  6, 
this  example  explores  two  executions,  replaying  some  scheduling  choices  in  the  course 
of  the  second  execution.  In  addition  to  depicting  the  execution  tree,  the  figure  also 
depicts  two  dashed  lines  that  record  the  progression  of  absolute  time  in  the  course  of 
the  two  executions.  The  first  execution  makes  two  calls  to  the  time  ( )  function  and 
their  outcome  is  simply  recorded  and  returned  to  the  caller.  The  second  execution  also 
makes  two  calls  to  the  t  ime  ( )  function.  In  contrast  to  the  first  execution,  the  outcome 
of  the  first  call  is  replayed  using  the  recorded  value,  returning  the  value  5.  When  the 
node  B  is  encountered,  the  time  travel  mechanism  recognizes  that  it  is  about  to  cross 
the  exploration  frontier  and  computes  the  difference  between  50,  the  current  time,  and 
10,  the  first  time  the  node  B  was  encountered.  This  difference  is  then  subtracted  from 
the  second  call  to  the  time  ( )  function,  producing  the  value  15,  instead  of  the  value 
55.  By  doing  so,  the  time  travel  mechanism  tricks  every  execution  into  thinking  they 
started  at  the  same  time. 

Interface  Modeling 

To  extend  dBug  to  support  interposition  of  new  events,  for  example  when  adding 
support  for  a  new  level  of  abstraction  (cf.  Chapter  7),  several  steps  are  necessary.  First, 
one  needs  to  implement  wrappers  that  allow  the  dBug  interposition  layer  to  intercept 
these  events  when  they  are  about  to  occur.  Second,  one  needs  to  extend  the  RPC 
interface  of  the  dBug  arbiter  with  methods  that  control  scheduling  of  the  intercepted 
events.  When  an  event  is  intercepted,  the  dBug  interposition  layer  issues  a  blocking  RPC 
to  the  dBug  arbiter  containing  information  about  the  intercepted  event.  The  RPC  returns 
when  the  dBug  arbiter  decides  to  schedule  the  intercepted  event  for  execution.  Third, 
one  needs  to  extend  the  dBug  arbiter  with  the  preschedule,  test,  and  execute  functions 
that  model  the  semantics  of  the  intercepted  events. 

By  and  large,  extending  dBug  with  support  for  one  event  requires  over  one  hundred 
lines  of  new  code  spread  across  different  source  files.  Unsurprisingly,  this  process  is 
time  consuming,  limiting  the  rate  at  which  the  support  for  new  events  and  existing 
events  can  be  added  and  modified  respectively.  To  address  this  problem,  dBug  uses  a 
custom  code  generator  to  generate  the  dBug  interposition  layer  wrappers  source  code 
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<type> 

: : =  list<type> 

|  <primitive> 

<primit ive> 

:  : =  boolean  | 

integer  |  string 

event 

Figure  3.14:  Type  Grammar  of  dBug  Object  Models 


and  the  dBug  arbiter  RPC  interface  automatically  from  the  prototypes  of  the  interposed 
events.  The  RPC  interface  implementation  is  in  turn  generated  from  the  dBug  arbiter 
RPC  interface  using  Apache  Thrift  [5]. 

In  addition,  dBug  offers  a  modeling  language  that  can  be  used  to  describe  the  objects 
maintained  as  part  of  the  global  state  and  the  effect  interposed  events  have  on  the  global 
state.  This  modeling  language  comes  with  a  compiler  that  uses  the  object  and  event 
models  written  in  this  language  to  automatically  generate  parts  of  the  dBug  arbiter 
source  code,  including  the  preschedule,  test,  and  execute  functions. 

The  objects  are  modeled  as  records  consisting  of  fields  that  have  a  name  and  a  type. 
The  types  recognized  by  dBug  are  summarized  by  the  grammar  presented  in  Figure  3.14. 
The  grammar  defines  four  primitive  types:  boolean,  integer,  string,  and  event. 
The  first  three  represent  the  standard  data  types,  while  the  event  type  represents  event 
identifiers.  In  addition  to  the  primitive  types,  the  grammar  defines  the  list<type> 
type  that  represents  lists  of  typed  elements. 

The  events  are  modeled  as  state  machines  that  input  typed  values  and  always  execute 
two  transitions  on  their  way  from  their  initial  state  to  their  final  state.  The  first  transition 
models  the  effect  of  the  preschedule  function,  while  the  second  transition  models  the 
effect  of  the  execute  function.  Each  transition  is  labeled  by  a  guard  that  identifies  the 
global  states  in  which  this  transition  is  enabled  and  an  action  that  describes  the  effect 
this  transition  has  on  the  global  state.  Further,  the  second  transition  is  also  labeled  with 
the  return  value  to  use  for  the  event. 

The  guards  and  actions  recognized  by  dBug  are  summarized  by  the  grammars 
presented  in  Figures  3.15  and  3.16  respectively.  The  object  terminal  represents 
different  types  of  objects  defined  by  the  object  models.  The  id  terminal  represents 
identifiers  of  the  state  machine  input  values.  The  field  terminal  represents  the  different 
object  fields  defined  by  the  object  models.  The  integer  terminal  represents  an  integer. 
The  event  terminal  represents  an  identifier  of  this  event.  The  tid  terminal  represents 
an  identifier  of  the  thread  that  intercepted  this  event.  The  exists  (object  (id)  ) 
guard  checks  if  the  object  identified  by  the  given  object  type  and  identifier  exists.  The 
empty  (object  (id)  .field)  guard  checks  if  the  list  identified  by  the  given  object 
type,  identifier,  and  field  is  empty.  The  contains  (object  (id)  .field,  <value>) 
guard  checks  if  the  list  identified  by  the  given  object  type,  identifier,  and  field  contains 
the  given  value.  The  length  (object  (id)  .  field)  value  equals  the  length  of  the  list 
identified  by  the  given  object  type,  identifier,  and  field.  The  create  (object  (<ids>)  ) 
action  creates  a  new  object  given  the  object  type  and  a  list  of  constructor  arguments. 
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<guard> 

:  :  = 

<guard>  and  <proposit ion>  |  <proposit ion> 

<proposit ion> 

:  :  = 

not (<proposition>) 

1 

exists  (object (id) ) 

1 

empty (object (id) . field) 

1 

contains (object (id) .field,  <value>) 

1 

<expression>  ==  <expression> 

1 

<expression>  >  <expression> 

1 

<expression>  <  <expression> 

1 

<expression> 

<expression> 

:  :  = 

<expression>  +  <value> 

1 

<expression>  -  <value> 

1 

<value> 

<value> 

:  :  = 

length (object ( id) .field) 

1 

object (id) .field 

1 

id 

1 

integer 

1 

event 

1 

t  id 

Figure  3.15:  Guard  Grammar  of  dBug  Event  Models 


The  delete  (object  (id)  )  action  deletes  an  existing  object  identified  by  the  given 
object  type  and  identifier.  The  enqueue  (object  (id)  .field,  <value>)  action 
adds  the  given  value  to  the  front  of  the  list  identified  by  the  given  object  type,  identifier, 
and  field.  The  remove  (object  (id)  .field,  <value>)  action  removes  all  elements 
whose  values  matches  the  given  value  from  the  list  identified  by  the  given  object 
type,  identifier,  and  field.  The  dequeue  ((object  (id)  .field)  action  removes  the 
last  element  of  the  list  identified  by  the  given  object  type,  identifier,  and  field.  The 
empty  (  (object  (id)  .field)  action  removes  all  elements  of  the  list  identified  by 
the  given  object  type,  identifier,  and  field.  The  warning  (string)  action  generates  a 
warning  message  using  the  given  string. 

To  illustrate  the  modeling  language  in  action.  Figures  3.17,  3.18,  3.19,  3.20,  and  3.21 
depict  the  event  models  for  the  events  of  the  pthreads  spinlock  interface.  Note  that 
these  figures  are  generated  automatically  from  the  event  models  using  our  compiler. 

In  addition  to  the  pthreads  spinlock  interface,  the  modeling  language  was  used  to 
model  the  pthreads  barrier,  condition  variable,  mutex,  and  read-write  lock  interfaces. 
The  object  and  events  models  resulting  from  this  effort  along  with  the  prototypes  of  the 
events  dBug  intercepts  amount  to  3,229  lines  of  data.  This  data  is  used  to  automatically 
generate  48,761  out  of  73,359  lines  (over  66%)  of  dBug's  source  code. 
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<act ion> 

:  :  = 

<statement s> 

<statement s> 

:  :  = 
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1 

delete  (object (id) ) 
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1 
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1 
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1 

empty (object (id) .field) 

1 

warning (string) 

1 

object (id) .field  =  <expression> 
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<expression>  +  <value> 
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<expression>  -  <value> 
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Figure  3.16:  Action  Grammar  of  dBug  Event  Models 


Figure  3.17:  Event  Model  of  pthread_spin_destroy  (id) 

State  Space  Estimation 

Similar  to  ETA,  dBug  recognizes  the  importance  of  estimating  test  complexity  and 
implements  a  technique  that  estimates  the  number  thread  interleavings  that  dBug 
examines  in  the  course  of  a  systematic  test.  In  particular,  to  estimate  the  size  of  a 
partially  explored  execution  tree,  dBug  uses  a  technique  based  on  the  weighted  backtrack 
estimator  [80]  (cf.  Chapter  4).  The  technique  treats  the  set  of  explored  branches  as 
a  sample  of  the  entire  execution  tree  assuming  uniform  distribution  over  edges  and 
computes  the  estimate  as  the  number  of  explored  branches  divided  by  the  aggregated 
probability  the  branches  are  explored  by  a  random  exploration. 
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guard:  exists(Spin(id))  guard: 

action:  waming('Spin  lock  already  exists.');  •  action: 


Figure  3.18:  Event  Model  of  pthread_spin_init  (id) 


guard:  not(exists(Spin(id)))  guard:  exists(Spin(id))  and  Spin(id). state  ==  1  and  Spin(id). owner  ==  tid  guard: 

.  action:  warning('Spin  lock  does  not  exist.'); !  action:  warning('Spin  lock  already  held.');  action: 


guard:  exists(Spin(id))  and  Spin(id). state  ==  0 
action:  Spin(id).state  =  1;  Spin(id). owner  =  tid; 
return:  0 


Figure  3.19:  Event  Model  of  pthread_spin_lock  (id) 


guard:  not(exists(Spin(id)))  guard:  exists(Spin(id))  and  Spin(id).state  ==  1  and  Spin(id).owner  ==  tid  ’.  guard: 

•.  action:  waming('Spin  lock  does  not  exist.'); :  action:  wamingf'Spin  lock  already  held.');  •  action: 


Figure  3.20:  Event  Model  of  pthread_spin_trylock  (id) 


guard:  not(exists(Spin(id)))  guard:  Spin(id). state  ==  0  guard:  not(Spin(id). owner  ==  tid)  guard: 

action:  waming('Spin  lock  does  not  exist.');  action:  warningf'Spin  lock  is  not  locked.');/  action:  warning('Spin  lock  is  not  owned  by  the  caller.');  action: 


guard:  exists(Spin(id))  and  Spin(id). state  ==  1 
action:  Spin(id). state  =  0;  Spin(id).owner  =  0; 
return:  0 


Figure  3.21:  Event  Model  of  pthread_spin_unlock (id) 
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Chapter  4 

State  Space  Estimation 


Over  the  course  of  the  past  15  years,  research  advances  in  systematic  testing  of  concur¬ 
rent  programs  [61,  82,  107,  160,  163,  165]  have  made  the  approach  practical  and  easy 
to  adopt.  As  systematic  testing  moves  into  wider  practice,  becoming  a  part  of  testing 
infrastructures  of  large-scale  system  developers  [107,  138],  new  practical  challenges  are 
emerging.  For  example,  when  resources  available  for  systematic  testing  are  scarce,  one 
needs  to  solve  the  problem  of  efficient  allocation  of  resources  to  a  collection  of  tests. 

Real-world  test  suites,  such  as  the  one  targeted  by  ETA  [138],  consist  of  hundreds  of 
tests  of  varied  complexity.  In  the  context  of  systematic  testing  of  concurrent  programs, 
it  is  reasonable  to  assume  that  the  resources  available  for  running  these  tests  are  not 
always  sufficient  to  complete  all  tests  in  the  time  allotted  for  testing.  In  such  cases, 
high-level  testing  objectives,  such  as  maximizing  the  number  of  completed  tests  or  achieving 
even  coverage  across  tests  can  be  used  to  drive  allocation  of  testing  resources.  Mapping 
these  high-level  testing  objectives  into  working  allocation  mechanisms  is  an  important, 
practical,  and  yet  unaddressed  problem. 

This  chapter  proposes  a  solution  to  this  problem  based  on  estimation  of  the  length 
of  the  state  space  exploration  carried  out  by  systematic  tests.  As  discussed  in  Chapter  2, 
most  systematic  testing  implementations  [61,  82,  137,  160]  use  stateless  exploration, 
recording  the  different  executions  encountered  during  a  test  in  an  execution  tree.  Under 
this  abstraction,  the  problem  of  test  length  estimation  can  be  formulated  as  the  problem 
of  estimating  the  length  of  execution  tree  exploration.  Besides  offering  a  measure  of  test 
complexity,  the  estimates  can  be  also  used  to  implement  resource  allocation  policies. 

The  estimation  techniques  presented  in  this  chapter  can  be  characterized  as  online 
-  updating  the  estimate  as  the  exploration  makes  progress  through  the  state  space  - 
and  passive  -  not  mandating  a  particular  order  in  which  the  exploration  proceeds.  The 
benefit  of  online  estimation  is  that  the  estimate  can  be  refined  as  new  information  about 
the  state  space  is  gathered,  while  the  benefit  of  passive  estimation  is  that  it  can  be 
combined  with  any  exploration  strategy.  Furthermore,  passive  estimation  techniques 
can  be  evaluated  using  exploration  traces.  To  enable  verification  of  the  results  presented 
in  this  chapter,  our  evaluation  uses  a  publicly  available  collection  of  exploration  traces 
from  a  real-world  deployment  at  Google  [140]. 


43 


This  chapter  makes  the  following  contributions.  First,  building  on  research  on  search 
tree  size  estimation  [80],  this  chapter  presents  techniques  for  estimating  the  length 
of  a  systematic  test.  Second,  this  chapter  demonstrates  the  practicality  of  test  length 
estimation  by  using  it  to  implement  several  resource  allocation  policies.  Third,  this 
chapter  uses  a  collection  of  exploration  traces  from  a  real-world  deployment  to  evaluate 
1)  the  accuracy  of  the  presented  estimation  techniques  and  2)  the  efficiency  of  the 
presented  resource  allocation  policies. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  4.1  describes  the  syntax  and 
semantics  of  exploration  traces.  Section  4.2  presents  1)  techniques  for  estimating  the 
length  of  systematic  tests  and  2)  policies  for  resource  allocation  based  on  test  length 
estimates.  Section  4.3  describes  a  collection  of  exploration  traces  [140]  and  uses  these 
traces  to  evaluate  the  accuracy  of  the  presented  estimation  techniques  and  the  efficiency 
of  the  presented  resource  allocation  policies.  Section  4.4  discusses  related  work  and 
Section  4.5  draws  conclusions. 


4.1  Background 

The  estimation  techniques  presented  in  this  chapter  are  passive,  not  mandating  a 
particular  order  in  which  the  exploration  proceeds.  Consequently,  the  problem  of 
estimation  of  the  length  of  a  systematic  test  can  be  described  using  the  abstraction  of  an 
exploration  trace,  which  identifies  events  pertinent  to  the  estimation. 

An  exploration  trace  is  a  sequence  of  events,  where  an  event  is  one  of  the  following: 

1.  AddNode  x  y  -  A  node  x  with  parent  y  is  added  to  the  execution  tree. 

2.  Explore  x  -  Node  x  is  scheduled  for  exploration. 

3.  Transition  x  -  Exploration  transitions  to  node  x. 

4.  Start  -  Exploration  of  a  new  execution  is  started  from  the  root  node. 

5.  End  t  -  Exploration  of  the  current  execution  has  finished  after  t  time  units. 

Example  8.  Figure  4.1  depicts  an  execution  tree  and  Figure  4.2  depicts  a  trace  of  its 
exploration.  Initially,  the  root  node  0  is  added  and  scheduled  for  exploration.  Next, 
an  execution  is  started  from  the  root  node.  The  children  1,  2,  and  3  of  the  root  node 
are  added  and  the  child  1  is  scheduled  for  exploration.  The  execution  then  transitions 
to  node  1 .  The  children  4  and  5  of  node  1  are  added  and  the  child  4  is  scheduled  for 
exploration.  The  execution  transitions  to  node  4.  The  child  6  of  node  4  is  added  and 
scheduled  for  exploration.  The  execution  transitions  to  node  6.  For  the  sake  of  this 
example,  let  us  assume  that  the  exploration  then  infers  that  node  3  needs  to  be  explored 
and  schedules  it  for  exploration.  This  concludes  exploration  of  the  current  execution, 
requiring  a  total  of,  for  example,  0.42  time  units.  Next,  a  new  execution  is  started  from 
the  root  node.  The  second  execution  steers  towards  node  3  and  explores  a  branch  in  its 
subtree,  requiring  a  total  of,  for  example,  0.29  time  units,  concluding  the  exploration. 
Note  that  nodes  2,  5,  and  8  are  never  explored  because  the  exploration  inferred  they 
will  not  produce  new  interleavings  of  concurrent  and  dependent  events. 
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Figure  4.1:  Execution  Tree  Example 
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Figure  4.2:  Exploration  Trace  Example 


4.2  Methods 

This  section  presents  a  number  of  techniques  for  estimating  the  length  of  an  exploration 
carried  out  by  the  dynamic  partial  order  reduction  (DPOR)  [51]  algorithm.  Notably, 
each  execution  explored  by  the  DPOR  algorithm  starts  from  the  initial  state  of  the 
program  (root  node  of  the  execution  tree)  and  ends  in  some  final  state  of  the  program 
(a  leaf  node  of  the  execution  tree).  An  estimation  technique  can  thus  be  described  as  an 
algorithm  that  operates  over  an  exploration  trace  described  in  Section  4.1.  In  particular, 
a  sequential  scan  of  an  exploration  trace  can  be  used  to  simulate  gradual  exploration  of 
the  execution  tree,  maintaining  the  exploration  status  of  each  node. 

The  presentation  of  the  estimation  techniques  is  divided  into  three  logical  compo¬ 
nents.  First,  the  strategy  component  is  used  to  determine  how  to  treat  nodes  of  the 
execution  tree  that  have  not  been  scheduled  for  exploration  yet.  Second,  the  estimator 
component  is  used  to  determine  how  to  combine  a  strategy  and  the  exploration  infor¬ 
mation  gathered  so  far  to  compute  an  intermediate  estimate.  Third,  the  fit  component  is 
used  to  aggregate  the  sequence  of  intermediate  estimates  computed  so  far  to  produce 
the  final  estimate. 


4.2.1  Strategies 

In  general,  the  DPOR  algorithm  does  not  explore  all  subtrees  of  an  execution  tree 
because  it  may  identify  exploration  of  some  parts  of  the  execution  tree  as  redundant. 
Further,  the  information  about  which  subtrees  need  to  be  explored  is  revealed  only  as 
the  exploration  makes  its  way  through  the  state  space.  The  first  step  towards  estimating 
the  length  of  an  exploration,  it  thus  to  determine  what  portions  of  a  partially  explored 
execution  tree  will  be  eventually  explored. 
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To  this  end,  this  chapter  considers  three  strategies  for  computing  the  function  F  : 
V  — *  V(V),  which  for  a  node  v  of  a  partially  explored  execution  tree,  estimates  the  set 
of  children  of  node  v  that  the  strategy  expects  to  schedule  for  exploration  in  the  future. 
Consequently,  at  any  point  of  the  exploration,  the  function  F  can  be  used  to  identify  the 
subset  of  nodes  of  a  partially  explored  execution  tree  that  the  strategy  believes  will  be 
eventually  explored. 

1.  Oracle  -  Assumes  perfect  knowledge  about  which  nodes  will  be  explored.  The 
function  F  is  computed  by  pre-processing  the  exploration  trace.  This  strategy  is 
infeasible  in  practice  but  this  chapter  considers  it  for  comparison  purposes. 

2.  Lazy  -  Assumes  that  a  node  will  not  be  explored  unless  it  has  been  already 
scheduled  for  exploration.  In  other  words,  F(v)  =  0  for  all  nodes  v. 

3.  Eager  -  Assumes  that  a  node  will  be  explored  unless  the  exploration  has  finished 
exploring  its  parent  without  scheduling  the  node  for  exploration.  In  other  words, 
F(v)  is  equal  to  the  set  of  children  of  node  v  that  are  yet  to  be  scheduled  for 
exploration  if  the  exploration  of  the  subtree  of  v  has  not  finished  and  0  otherwise. 

Space  and  Time  Complexity 

The  oracle  strategy,  which  is  used  for  benchmarking  purposes  only,  requires  linear 
pre-processing  time  and  has  O(n)  space  overhead,  where  n  is  the  size  of  the  execution 
tree.  The  lazy  and  the  eager  strategies  have  no  overhead. 

4.2.2  Estimators 

After  a  strategy  is  used  to  determine  which  parts  of  a  partially  explored  execution  tree 
will  be  explored,  this  information  is  passed  onto  an  estimator,  responsible  for  producing 
an  intermediate  estimate.  This  chapter  considers  two  different  estimators  based  on 
previous  work  [80]:  the  weighted  backtrack  estimator  and  the  recursive  estimator. 

Weighted  Backtrack  Estimator 

The  weighted  backtrack  estimator  (WBE)  is  an  online  variant  of  Knuth's  offline  tech¬ 
nique  [85]  for  tree  size  estimation.  WBE  uses  the  length  of  each  explored  branch 
weighted  by  the  probability  it  is  explored  by  a  random  exploration  (assuming  uniform 
distribution  over  edges)  to  predict  the  size  of  the  tree.  To  adapt  WBE  to  exploration 
length  estimation,  the  length  of  each  branch  is  replaced  with  the  time  used  to  explore  it. 
Formally,  WBE  updates  its  estimate  every  time  the  DPOR  algorithm  explores  a  branch, 
setting  the  estimate  to: 

zrn 

estimate  =  , 

E  p(b) 

beB 

where  B  is  the  set  of  explored  branches,  t(b)  is  the  time  used  to  explore  a  branch,  and 
p(b)  is  the  probability  of  exploring  the  branch,  presented  next. 
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For  a  branch  b  =  (v  \, . . .  ,vn),  where  vl  is  the  z-th  node  along  the  branch  b,  the 
probability  p(b)  is  defined  as: 

n— 1  ^ 

n  \E(vi)\  +  |S(z7f) I  +  \F(vi)\ 

where  E(z ;,)  is  the  set  of  explored  children  of  node  vlr  S(vi  ')  is  the  set  of  children  of  node 
V(  scheduled  for  exploration,  and  F(vj)  is  determined  by  the  strategy  (cf.  Section  4.2.1). 


Recursive  Estimator 

The  recursive  estimator  (RE)  is  an  online  technique  that  estimates  the  size  of  an  un¬ 
explored  subtree  using  the  arithmetic  mean  of  the  estimated  sizes  of  its  (partially) 
explored  siblings.  To  adapt  RE  to  exploration  length  estimation,  the  size  of  each  subtree 
is  replaced  with  the  time  used  to  explore  it.  Formally,  when  the  DPOR  algorithm 
explores  a  branch  b  =  (v\, . . . ,  vn),  RE  updates  the  exploration  length  estimate  for  every 
node  along  the  branch  b  in  a  bottom-up  manner,  setting  the  estimate  to: 


estimate  (vi) 


\S(vi)\  +  \F(vi)\\ 

|E("f)l  ) 


m 


X 


£  estimate (v) 

veE(vi) 


if  i  7^  n 
if  i  —  n 


where  E(v{),  S(vj),  and  E(vi)  have  the  same  meaning  as  above  and  t(b)  is  the  time  used 
to  explore  the  branch  b. 


Space  and  Time  Complexity 

The  WBE  estimate  needs  to  be  updated  upon  two  events:  1)  when  a  new  branch  is 
explored  and  2)  when  a  node  is  scheduled  for  exploration.  To  avoid  recomputation  of 
all  of  the  values  p(b)  and  t(b)  for  each  update  of  the  estimate,  one  can  store,  for  each 
node  v  of  the  execution  tree,  the  sums: 

E  p(&)  and  E  *(&) 

beB(v)  beB(v) 

where  B  (v)  is  the  set  of  explored  branches  that  contain  the  node  v. 

When  a  new  branch  is  explored,  the  aggregate  probability  and  time  values  make  it 
possible  to  update  the  WBE  estimate  by  updating  only  the  values  for  the  nodes  along  the 
current  branch.  Note  that  although  the  time  complexity  of  updating  the  WBE  estimate 
for  a  new  branch  is  linear  in  the  depth  of  the  execution  tree,  this  time  complexity  is 
amortized  over  the  time  used  to  explore  the  branch  to  a  constant. 

When  a  new  node  is  scheduled  for  exploration,  the  aggregate  probability  and  time 
values  of  the  nodes  along  the  exploration  frontier  need  to  be  updated.  The  time 
complexity  of  this  operation  is  0(d),  where  d  is  the  depth  of  the  execution  tree.  Unlike 
exploring  a  new  branch,  the  cost  associated  with  updating  the  estimate  when  a  node  is 
scheduled  for  exploration  does  not  have  constant  amortized  complexity.  In  practice,  the 
worst  case  scenario  is  rare  and  our  evaluation  did  not  experience  significant  overhead. 
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In  contrast  to  the  WBE  estimate,  the  RE  estimate  needs  to  be  updated  only  when  a 
new  branch  is  explored.  The  time  complexity  of  this  operation  is  linear  in  the  depth  of 
the  execution  tree  but  amortizes  to  0(1)  over  the  time  used  to  explore  the  branch. 

4.2.3  Fits 

As  the  exploration  progresses,  new  and  presumably  more  accurate  intermediate  esti¬ 
mates  are  computed.  Previous  work  [80]  considers  the  intermediate  estimates  produced 
at  distinct  time  points  in  isolation,  using  only  the  latest  estimate  for  decision  making. 
In  comparison,  this  chapter  treats  the  intermediate  estimates  as  a  progression,  fitting  it 
with  a  function  /  (f)  and  using  the  solution  of  the  equation  /(f)  =  t  as  the  final  estimate. 

Both  intuition  and  experience  suggests  that  the  exploration  length  estimates  can 
be  initially  inaccurate  but  over  the  course  of  an  exploration  converge  to  the  correct 
value.  Consequently,  this  chapter  uses  a  method  to  interpolate  the  progression  of 
estimates  over  time.  The  interpolation  method  is  based  on  the  Marquardt-Levenberg 
algorithm  [90,  100]  for  weighted  non-linear  least-square  fitting.  The  algorithm  is  used  to 
find  the  values  for  coefficients  of  the  function  that  best  fits  the  sequence  of  intermediate 
estimates.  To  reflect  the  increasing  confidence  in  estimates  over  time,  the  least-square 
fitting  is  weighted,  using  f  as  the  weight  for  an  estimate  at  time  f.  This  chapter  considers 
four  different  fitting  functions: 

1.  Empty  function:  This  scheme  does  no  fitting.  Instead,  it  emulates  previous 
work  [80],  using  the  latest  intermediate  estimate  as  the  final  estimate. 

2.  Constant  function:  /(f)  =  c.  The  advantage  of  using  a  constant  function  is  that  the 
final  estimate  computed  by  solving  the  equation  /(f)  =  f  is  guaranteed  to  be  a 
positive  number.  The  disadvantage  of  using  a  constant  function  is  that  it  does  not 
detect  trends. 

3.  Linear  function:  /(f)  =  a  *  t  +  b.  The  advantage  of  using  a  linear  function  is  its 
ability  to  detect  linear  trends  in  the  sequence  of  intermediate  estimates.  However, 
in  pathological  cases,  the  final  estimate  computed  by  solving  the  equation  /  (f)  =  f 
might  be  a  negative  number. 

4.  Logarithmic  function:  /(f)  =  a  *  ln(t)  +  b.  The  advantage  of  using  a  logarithmic 
function  is  its  ability  to  detect  non-linear  trends  in  the  sequence  of  intermediate 
estimates.  However,  in  pathological  cases,  the  equation  /(f)  =  f  might  have  no 
solution,  preventing  computation  of  the  final  estimate. 

Space  and  Time  Complexity 

The  space  and  time  complexity  of  fitting  the  empty  function  is  0(1).  The  space 
complexity  of  fitting  the  other  functions  is  linear  in  the  length  of  the  sequence  being 
fitted.  As  for  the  time  complexity,  the  Marquardt-Levenberg  algorithm  uses  a  hill 
climbing  technique.  A  single  iteration  of  the  algorithm  is  linear  in  the  length  of  the 
sequence  being  fitted.  The  number  of  iterations  is  potentially  unbounded  and  depends 
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on  the  desired  precision.  In  other  words,  fitting  a  function  every  time  a  new  intermediate 
estimate  results  in  time  complexity  that  is  quadratic  in  the  length  of  the  sequence  being 
fitted.  For  long  sequences  of  intermediate  estimates,  this  overhead  can  be  reduced  by 
employing  reservoir  sampling  [154],  which  maintains  a  constant-size  set  of  random 
samples  of  the  sequence  of  intermediate  estimates. 

4.2.4  Resource  Allocation  Policies 

A  test  suite  of  a  large-scale  system  under  development  is  expected  to  consist  of  many 
systematic  tests  of  varied  complexity.  It  is  not  unreasonable  to  expect  that  the  resources 
available  for  running  these  tests  are  not  always  sufficient  to  complete  all  tests  in  the 
time  allotted  for  testing.  This  subsection  describes  how  to  use  test  length  estimation  to 
map  testing  objectives  to  effective  policies  for  allocation  of  scarce  testing  resources. 

All  policies  maintain  a  priority  queue  of  systematic  tests  used  to  identify  which 
systematic  test  to  advance  next.  After  a  systematic  test  is  identified,  a  new  branch  of  its 
execution  tree  is  explored,  and  its  estimate  and  priority  queue  position  are  updated. 
This  process  is  repeated  for  as  long  as  there  are  unfinished  systematic  tests  and  the  time 
allotted  for  testing  has  not  expired.  Different  testing  objectives  are  distinguished  by  the 
function  used  for  ordering  the  elements  of  the  priority  queue.  In  particular,  this  chapter 
considers  two  different  testing  objectives: 

1.  Maximize  the  number  of  completed  tests :  This  objective  is  motivated  by  the  guarantee 
realized  upon  completion  of  a  systematic  test.  For  this  objective,  elements  of 
the  priority  queue  are  ordered  by  their  estimated  time  to  completion,  which  is 
computed  by  subtracting  the  elapsed  time  from  the  test  length  estimate.  In  other 
words,  the  resource  allocation  follows  the  "shortest  remaining  time  first"  policy. 

2.  Achieve  even  coverage  across  tests:  This  objective  is  motivated  by  allocating  testing 
resources  proportionally  to  the  length  of  each  systematic  test.  For  this  objective, 
elements  of  the  priority  queue  are  ordered  by  their  estimated  progress,  which  is 
computed  by  dividing  the  elapsed  time  by  the  test  length  estimate.  In  other  words, 
the  resource  allocation  follows  the  "smallest  coverage  first"  policy. 


Space  and  Time  Complexity 

The  space  complexity  of  maintaining  a  priority  queue  is  0(k),  where  k  is  the  number  of 
systematic  tests.  The  time  complexity  of  identifying  the  top  element  of  a  priority  queue 
is  0(1)  and  the  time  complexity  of  updating  the  value  of  the  top  element  is  0(\ogk). 


4.3  Evaluation 

The  goal  of  this  section  is  to  evaluate  1)  the  accuracy  of  the  estimation  techniques  and 
2)  the  efficiency  of  resource  allocation  mechanisms  described  in  Section  4.2. 
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To  this  end,  this  section  uses  a  set  of  10  exploration  traces  recently  released  by 
Google  [140].  The  evaluation  uses  a  trace  simulator  that  reads  these  traces,  simulates 
the  exploration  of  the  execution  tree,  and  computes  an  intermediate  estimate  every 
time  a  branch  of  the  execution  tree  is  explored.  Additionally,  the  evaluation  uses  an 
implementation  of  the  Marquardt-Levenberg  algorithm  that  inputs  a  progression  of 
intermediate  estimates  and  a  function  template  and  computes  what  function  coefficients 
to  use  to  fit  the  function  template  with  the  progression. 


4.3.1  Exploration  Traces 

The  set  of  exploration  traces  used  for  our  evaluation  is  summarized  in  Table  4.1.  The 
table  identifies  the  name  of  a  test,  the  number  of  nodes  of  its  execution  tree,  the  number 
of  branches  of  its  execution  tree,  and  the  total  time  used  at  Google  for  the  exploration. 
The  unit  of  time  is  abstract  as  the  timing  of  the  exploration  traces  has  been  scaled  by  a 
magic  constant  during  the  trace  anonymization  process  [140]. 


Test  Name 

#  Nodes 

#  Branches 

Time 

Resource(2) 

110 

8 

2.42 

Resource(3) 

4,914 

279 

86.15 

Resource(4) 

248,408 

12,054 

4,438.54 

Scheduling(6) 

29,578 

720 

250.80 

Scheduling(7) 

237,528 

5,040 

1,956.32 

Scheduling(8) 

2,142,164 

40,320 

19,868.90 

Store(3,3,7) 

20,577 

924 

392.78 

Store(3,3,8) 

88,386 

3,790 

1,715.49 

Store(3,3,9) 

230,747 

9,230 

2,613.85 

TLP 

4,201,044 

27,200 

24,197.60 

Table  4.1:  Test  Statistics 

The  Resource (x)  tests  are  representative  of  a  class  of  tests  that  evaluate  interactions 
of  x  different  users  that  acquire  and  release  resources  from  a  pool  of  x  resources.  The 
Scheduling  (x)  tests  are  representative  of  a  class  of  tests  that  evaluate  handling  of  x 
concurrent  scheduling  requests.  The  Store  (x,y,z)  tests  are  representative  of  a  class 
of  tests  that  evaluate  interactions  of  x  users  of  a  distributed  key-value  store  with  y 
front-end  nodes  and  z  back-end  nodes.  Finally,  the  TLP  test  is  representative  of  a  class 
of  tests  that  perform  scheduling  work. 


4.3.2  Accuracy  Evaluation 

To  evaluate  estimation  accuracy,  an  exploration  is  advanced  for  some  time,  generating 
intermediate  estimates  using  different  combinations  of  strategies  and  estimators.  The 
intermediate  estimate  progression  is  then  fitted  by  the  various  fit  functions  and  the 
solutions  of  the  /(f)  =  t  equations  are  compared  to  the  known  correct  value. 
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The  experiment  collected  progressions  of  intermediate  estimates  computed  during 
simulation  of  the  exploration  traces  Resource(4),  Scheduling^),  Store(3,3,9),  and 
TLP,  which  are  representative  of  the  full  set.  For  each  test,  the  complete  progression  of 
intermediate  estimates  was  collected  for  all  six  possible  combinations  of  the  oracle,  lazy, 
and  eager  strategies  and  the  weighted  backtrack  and  recursive  estimators.  For  each 
progression,  the  empty,  constant,  linear,  and  logarithmic  fits  were  then  computed  using 
the  initial  1%,  5%,  and  25%  of  the  progression. 

Figure  4.3  depicts  the  results  of  the  experiment.  Given  the  correct  value  correct,  the 
accuracy  of  the  estimate  estimate  is  computed  as  follows: 


accuracy  {correct,  estimate ) 


100  *  (correct / estimate)%  if  estimate  >  correct 
100*  (estimate / correct)%  otherwise 


For  example,  if  the  correct  value  is  100,  then  the  accuracy  of  the  estimate  50  is  50%, 
while  the  accuracy  of  the  estimate  400  is  25%.  In  other  words,  the  above  measure 
of  accuracy  does  not  distinguish  between  under-estimation  and  over-estimation,  a 
provision  used  to  simplify  the  presentation. 

Figure  4.3  consists  of  three  bar  graphs,  one  for  each  percentage  at  which  the  fit  was 
computed.  Each  of  the  bar  graphs  depicts  the  accuracy  achieved  for  the  96  different 
combinations  of  a  test,  a  strategy,  an  estimator,  and  a  fit,  with  the  results  clustered  by  fit 
and  test.  Note  that  the  vertical  axis  depicting  the  accuracy  is  in  logarithmic  scale.  In 
some  cases,  the  application  of  a  fit  did  not  generate  a  positive  solution  for  the  estimate 
and  in  such  cases  the  bar  is  missing. 


Strategy  Comparison 

The  evaluation  indicates  that,  unlike  the  eager  strategy,  the  lazy  strategy  is  consistent 
with  the  oracle  strategy.  In  other  words,  the  lazy  and  the  oracle  strategies  tend  to  agree 
on  which  children  will  be  explored  in  the  future;  in  fact,  in  the  case  of  the  Scheduling 
tests  and  the  TLP  test,  the  lazy  strategy  and  the  oracle  strategy  are  indistinguishable. 
At  the  same  time  the  evaluation  indicates  that  the  the  oracle  strategy  does  not  always 
produce  the  most  accurate  results.  Surprisingly,  the  eager  strategy  occasionally  produces 
the  most  accurate  results.  The  poor  performance  of  estimation  techniques  based  on  the 
oracle  strategy  is  attributed  to  under-estimation  introduced  by  the  estimators  and  fits, 
which  the  eager  strategy  compensates  for  with  its  over-estimation. 


Estimator  Comparison 

The  evaluation  does  not  indicate  that  either  of  the  two  estimators  consistently  outper¬ 
forms  the  other  one.  The  weighted  backtrack  estimator,  however,  produces  estimates 
that  are,  except  for  two  cases,  within  an  order  of  magnitude  of  the  correct  value.  This 
cannot  be  said  about  the  recursive  estimator.  This  makes  the  weighted  backtrack 
estimator  more  a  robust  choice. 
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Figure  4.3:  Accuracy  of  Estimation  Techniques 
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RE  +  Hindsight  WBE  +  Hindsight  m  RE  +  Lazy  WBE+Lazy  r  i  RE  +  Eager  WBE  +  Eager 


Fit  Comparison 


The  linear  fit  often  fails  to  generate  a  positive  solution,  which  makes  it  unreliable. 
Comparing  the  empty  fit  to  the  constant  fit,  the  empty  fit  produces  equivalent  or  better 
results  in  most  of  the  cases,  which  confirms  our  intuition  that  the  intermediate  estimates 
grow  more  accurate  with  time.  The  logarithmic  fit  generates  a  solution  in  all  but  two 
cases,  and  compared  to  the  empty  fit,  produces  equivalent  or  better  results. 
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62.11% 

75.19% 

93.46% 

55.87% 

68.97% 

L+R+E 

16.95% 

86.21% 

16.45% 

9.24% 

16.69% 

L+R+L 

48.78% 

N/A 

75.19% 

27.70% 

42.92% 

L+W+L 

14.03% 

89.29% 

52.91% 

13.18% 

22.57% 

Table  4.2:  Accuracy  of  Best  Techniques  after  1% 

E+W+E 

45.66% 

80.65% 

91.74% 

84.03% 

69.93% 

L+R+E 

30.68% 

92.59% 

94.34% 

15.22% 

33.44% 

L+R+L 

63.69% 

87.72% 

40.82% 

23.42% 

42.37% 

L+W+L 

51.02% 

94.34% 

41.15% 

23.58% 

41.32% 

Table  4.3:  Accuracy  of  Best  Techniques  after  5% 

60.61% 
70.92% 
90.09% 
72.99% 


E+W+E 

42.19% 

85.47% 

59.17% 

74.07% 

L+R+E 

66.67% 

99.01% 

97.09% 

47.39% 

L+R+L 

93.46% 

90.91% 

91.74% 

85.47% 

L+W+L 

85.47% 

95.24% 

60.61% 

62.50% 

Table  4.4:  Accuracy  of  Best  Techniques  after  25% 


Overall  Comparison 

To  analyze  the  overall  accuracy.  Tables  4.2,  4.3,  and  4.4  take  a  closer  look  at  the  best¬ 
performing  techniques,  reporting  the  accuracy  after  1%,  5%,  and  25%  of  the  exploration 
respectively.  The  acronyms  in  the  Technique  column  have  the  following  meaning: 
e+w+e  stands  for  the  eager  strategy,  the  weighted  backtrack  estimator,  and  the  empty 
fit,  l+r+e  stands  for  the  lazy  strategy,  the  recursive  estimator,  and  the  empty  fit,  l+r+l 
stands  for  the  lazy  strategy,  the  recursive  estimator,  and  the  logarithmic  fit,  and  l+w+l 
stands  for  the  lazy  strategy,  the  weighted  backtrack  estimator,  and  the  logarithmic  fit. 

Interestingly,  the  accuracy  of  the  e+w+e  technique  does  not  improve  over  time, 
while  the  accuracy  of  the  other  techniques  do.  To  understand  this  phenomenon  better. 
Figure  4.4  depicts  the  progression  of  intermediate  estimates  over  time  for  selected 
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Figure  4.4:  Evolution  of  Intermediate  Estimates  Over  Time 
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tests  and  best-performing  techniques.  Each  graph  also  includes  a  horizontal  line  that 
identifies  the  correct  value. 

Figure  4.4  reveals  that  the  intermediate  estimates  based  on  the  eager  estimator  can 
change  quickly  over  time,  which  suggests  that  the  accuracy  of  estimation  techniques 
based  on  the  combination  of  the  eager  strategy  and  the  empty  fit  can  be  sensitive  to  the 
time  at  which  their  estimates  are  computed.  Further,  note  that  eager  strategy  typically 
results  in  over-estimation,  while  the  lazy  strategy  typically  results  in  under-estimation. 

In  summary,  many  techniques  examined  in  this  chapter  consistently  achieve  average 
accuracy  above  60%  after  exploring  as  little  as  1%  of  the  state  space.  At  first  glance 
the  e+w+e  technique  seems  to  be  the  best  but  further  investigation  suggests  the  l+w+l 
technique  is  a  more  robust  choice.  This  conclusion  is  based  on  two  facts:  1)  the  accuracy 
of  fitting  a  logarithm  to  a  progression  of  intermediate  estimates  generated  by  the 
combination  of  the  lazy  strategy  and  weighted  backtrack  estimator  and  2)  the  resilience 
of  the  logarithmic  fit  to  spikes  and  drops  in  the  progression  of  intermediate  estimates. 

4.3.3  Efficiency  Evaluation 

This  subsection  evaluates  the  potential  of  test  length  estimation  to  help  implement  allo¬ 
cation  policies  that  target  the  two  testing  objectives  described  in  Section  4.2:  maximizing 
the  number  of  completed  tests  and  achieving  even  coverage. 

Maximizing  #  of  Completed  Tests 

To  evaluate  how  well  test  length  estimation  techniques  help  in  maximizing  the  number 
of  completed  tests,  a  resource  allocation  simulator  was  created.  This  simulator  maintains 
a  priority  queue  that  tracks  the  remaining  time  estimated  for  each  exploration  trace  and 
a  scheduler  that  uses  the  previously  described  trace  simulator  to  advance  exploration  in 
the  order  dictated  by  the  priority  queue. 

The  evaluation  examines  all  combinations  of  the  weighted  backtrack  and  recursive 
estimators  and  the  oracle,  eager  and  lazy  strategies.  However,  given  the  frequency  of 
estimate  computation  in  this  experiment,  only  the  empty  fit  is  considered  in  order  to 
limit  the  simulation  duration. 

For  the  sake  of  comparison,  two  additional  allocation  policies  are  examined:  1)  a 
round-robin  policy,  which  selects  the  next  test  to  advance  using  round-robin,  representing 
a  baseline  approach  commonly  used  in  practice,  and  2)  an  optimal  policy,  which  knows 
the  true  length  of  each  tests  and  executes  tests  from  the  shortest  to  the  longest. 

Fastly,  the  allocation  simulator  was  provided  with  unlimited  time  and,  for  each 
estimation  technique,  recorded  the  time  at  which  the  10  different  exploration  traces  are 
completed.  In  other  words,  instead  of  using  a  fixed  time  budget,  the  recorded  data  can 
be  used  to  derive  the  results  for  an  arbitrary  fixed  time  budget. 

Figure  4.5  presents  a  bar  graph,  which  for  each  allocation  policy  plots  the  time 
needed  to  complete  a  certain  number  of  tests.  The  horizontal  axis  shows  the  number 
of  completed  tests,  while  the  vertical  axis  shows  time  units  on  a  logarithmic  scale. 
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Figure  4.5:  Maximizing  #  of  Completed  Tests 


Completed 

RE-EAGER 

RE-LAZY 

WBE-EAGER 

WBE-LAZY 

OPTIMAL 

1 

0.29 

0.29 

0.29 

0.29 

0.08 

2 

0.38 

0.09 

0.09 

0.09 

0.08 

3 

0.30 

0.20 

0.13 

0.20 

0.13 

4 

8.08 

0.24 

0.22 

0.25 

0.22 

5 

2.58 

0.27 

0.23 

0.31 

0.21 

6 

2.03 

0.49 

1.82 

0.50 

0.30 

7 

1.47 

0.45 

2.00 

0.46 

0.30 

8 

1.24 

0.46 

1.79 

0.48 

0.40 

9 

1.11 

0.73 

1.10 

0.73 

0.64 

10 

1.00 

1.00 

1.00 

1.00 

1.00 

Mean 

1.85 

0.42 

0.87 

0.43 

0.34 

Table  4.5:  Performance  of  Allocation  Algorithms 


Note  that  the  measurement  is  oblivious  to  the  order  in  which  the  tests  finish  and  only 
compares  the  times  required  by  different  algorithms  to  complete  a  certain  number  tests. 

The  experiment  results  indicate  that  the  policies  that  incorporate  the  eager  strategy 
tend  to  perform  poorly  and,  in  some  cases,  need  more  time  to  complete  a  certain  number 
of  tests  than  the  round-robin  policy.  In  contrast  to  that,  the  policies  that  incorporate  the 
oracle  and  the  lazy  strategies  match  the  performance  of  the  optimal  policy,  requiring 
a  fraction  of  the  time  required  by  the  round-robin  policy  to  complete  first  few  tests. 
Further,  the  experiment  results  indicate  than  in  the  context  of  this  experiment,  the  choice 
of  the  estimator  is  not  significant. 

Table  4.5  reports  the  fractions  of  the  time  required  by  the  best  performing  algorithms 
with  respect  to  the  baseline  round-robin  policy.  The  rows  report  these  fractions  for 
each  possible  number  of  completed  tests  and  the  last  row  reports  their  arithmetic  mean. 
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These  numbers  indicate  that  the  policies  based  on  a  combination  of  the  lazy  strategy, 
either  of  the  two  estimators,  and  the  empty  fit,  reduce  the  time  needed  by  the  baseline 
round-robin  policy  to  complete  a  certain  number  of  tests  by  2.38  x  on  average,  coming 
close  to  the  optimal  value  of  2.94  x. 

Achieving  Even  Coverage 

To  evaluate  how  well  test  length  estimation  techniques  help  in  achieving  even  coverage 
across  the  test  suite,  the  resource  allocation  simulator  from  the  previous  subsection  was 
modified  to  order  elements  of  its  priority  queue  by  the  estimated  coverage  instead  of 
the  estimated  remaining  time. 

Similarly  to  the  previous  experiment,  all  combinations  of  the  strategies,  the  estima¬ 
tors,  and  the  empty  fit  were  examined  and  a  baseline  was  modeled  using  a  policy  based 
on  the  round-robin  order.  In  contrast  to  the  previous  experiment,  a  scarcity  of  testing 
resources  was  simulated  by  setting  the  time  budget  to  5, 000  time  units,  representing 
10%  of  the  time  needed  to  fully  explore  all  tests. 

For  each  test,  the  experiment  measured  the  coverage  error  computed  as  the  relative 
difference  between  the  realized  and  even  coverage.  Formally,  let  timesum  be  sum  of  the 
actual  runtimes  of  all  tests  in  a  test  suite  and  timeyu^e t  <  timesum  be  the  time  allotted 
for  testing,  then  even  coverage  equals: 


even  =  — ; - — 

tiineSUm 


Further,  let  timefuu(t )  be  the  time  needed  to  fully  explore  the  test  t  and  timeexpioreci(t) 
be  the  time  spent  exploring  the  test  t,  then  the  realized  coverage  of  the  test  t  equals: 


realized(t) 


Hme  explored^) 

time  full  (t) 


and  the  coverage  error  of  the  realized  coverage  realized(t)  with  respect  to  the  even 
coverage  even,  denoted  error  (realized  (f),  even)  is  defined  as: 


error  (realized  (t) ,  even) 


realized (f) /even  if  even  >  realized{t ) 
even  /  realized(t)  otherwise 


For  example,  if  the  target  even  coverage  is  10%,  then  the  coverage  error  of  the 
realized  coverage  5%  is  2,  while  the  coverage  error  of  the  realized  coverage  40%  is 
4.  In  other  words,  the  above  measure  of  coverage  error  does  not  distinguish  between 
falling  short  of  and  exceeding  the  target  even  coverage,  a  provision  used  to  simplify  the 
presentation. 

Figure  4.6  depicts  a  bar  graph,  which  for  each  test  contains  a  cluster  of  the  coverage 
errors  achieved  by  each  allocation  policy.  The  horizontal  axis  identifies  the  test  cluster, 
while  the  vertical  axis  plots  the  coverage  error. 
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Figure  4.6:  Achieving  Even  Coverage 


Test 

ROUND-ROBIN 

RE-EAGER 

RE-LAZY 

WBE-EAGER 

WBE-LAZY 

Resource  (2) 

11.11 

1.36 

1.36 

1.36 

1.36 

Resource  (3) 

11.11 

2.77 

3.26 

2.77 

4.05 

Resource  (4) 

1.31 

3.17 

1.51 

1.80 

2.87 

Scheduling  (6) 

11.11 

1.35 

1.96 

1.38 

2.09 

Scheduling  (7) 

3.19 

1.13 

1.92 

1.23 

2.02 

Scheduling  (8) 

2.59 

1.12 

2.00 

1.33 

2.07 

Store  (3,3,7) 

11.11 

1.61 

1.59 

1.22 

1.20 

Store(3,3,8) 

4.12 

2.01 

2.00 

1.35 

1.15 

Store(3,3,9) 

2.35 

1.78 

1.75 

1.01 

1.22 

TLP 

1.67 

1.71 

8.15 

1.71 

7.12 

Mean 

5.97 

1.80 

2.55 

1.52 

2.51 

Table  4.6:  Coverage  Error  of  Allocation  Algorithms 


The  experiment  results  indicate  that  the  policies  that  incorporate  the  oracle  and  lazy 
strategies  tend  to  perform  poorly  and  in  some  cases  achieve  even  higher  coverage  error 
than  the  baseline  policy.  In  contrast  to  that,  the  policies  based  on  the  eager  strategy  often 
produce  the  best  result.  Further,  the  policies  that  incorporate  the  weighted  backtrack 
estimator  tend  to  perform  better  than  those  that  incorporate  the  recursive  estimator. 

Table  4.6  compares  the  coverage  errors  of  the  best-performing  policies  to  the  coverage 
error  achieved  by  the  baseline  round-robin  policy.  The  rows  report  the  coverage  errors 
for  individual  tests  and  the  last  row  reports  their  arithmetic  mean.  The  experiment 
indicates  that  the  best  allocation  policy  is  based  on  a  combination  of  the  eager  strategy, 
the  weighted  backtrack  estimator,  and  the  empty  fit,  and  this  policy  reduces  the  average 
coverage  error  of  the  baseline  round-robin  policy  from  5.97  down  to  1.5. 
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4.4  Related  Work 


State  Space  Estimation 

This  chapter  adapts  work  on  estimating  search  tree  size  by  Kilby  et  al.  [80]  to  the 
problem  of  estimating  tree  exploration  length.  In  their  evaluation,  Kilby  et  al.  present 
the  accuracy  of  their  estimation  techniques  achieved  during  exploration  of  search  trees 
corresponding  to  both  decision  and  optimization  problems. 

Similar  to  our  work,  the  techniques  of  Kilby  et  al.  can  be  described  as  online  and 
passive.  However,  in  contrast  to  search  tree  size  estimation,  our  techniques  have  to 
address  the  dynamic  nature  of  the  DPOR  algorithm.  Nonetheless,  comparing  the 
accuracy  of  our  estimation  techniques  to  Kilby  et  al.  estimation  techniques  on  search 
trees,  our  techniques  perform  equally  well  on  a  harder  problem. 

Taleghani  and  Atlee  [146]  studied  the  problem  of  state  space  coverage  estimation  for 
explicit-state  model  checking.  Their  solution  is  based  on  Monte  Carlo  techniques  and 
complements  state  space  exploration  with  random  walks  to  estimate  the  ratio  between 
visited  and  unvisited  states.  They  implemented  their  technique  for  the  Java  PathFinder 
(JPF)  [153]  model  checker  and  used  a  collection  of  Java  programs  for  evaluation. 

In  contrast  to  our  work,  the  technique  of  Taleghani  and  Atlee  is  limited  to  stateful 
approaches  and  can  be  described  as  offline  and  active.  More  precisely,  the  estimate 
is  computed  at  the  end  of  the  exploration  and  the  computation  relies  on  a  particular 
exploration  strategy.  Although  Taleghani  and  Atlee  do  not  explicitly  mention  whether 
their  experiments  were  carried  out  in  the  context  of  state  space  reduction,  since  JPF 
supports  it,  we  assume  they  were.  If  that  is  the  case,  our  technique  performs  equally 
well  on  a  similar  problem  but  does  not  rely  a  specific  exploration  algorithm. 


Resource  Allocation 

Dynamic  allocation  of  resources  to  a  collection  of  independent  tasks  is  both  a  well 
studied  theoretical  problem  [150]  and  a  practical  problem  addressed  by  a  range  of 
systems  ranging  from  batch  schedulers  such  as  the  Maui  scheduler  [77]  to  platforms  for 
sharing  resources  in  a  data  center  [69,  129]. 

While  in  practice  [69,  77,  129]  tasks  are  usually  running  concurrently  on  a  cluster, 
this  paper  uses  a  simple  model  that  schedules  tasks  sequentially.  This  simplification 
is  justified  by  the  unique  nature  of  the  execution  tree  exploration,  which  typically 
consists  of  many  executions  of  the  same  test.  Recording  progress  of  each  exploration 
using  an  execution  tree  enables  fine-grained  interleaving  of  concurrent  explorations. 
In  addition,  different  executions  of  the  same  test  can  be  explored  in  parallel,  enabling 
linear  speed-up  [139]  (cf.  Chapter  5).  Representing  a  cluster  of  machines  as  a  sequence 
of  machine  cycles  is  a  reasonable  abstraction  of  a  large  set  of  small  independent  tasks. 

Another  unique  aspect  of  resource  allocation  among  systematic  tests  is  how  value 
is  measured.  In  general,  value  of  a  task  accrues  when  it  finishes,  which  is  reflected 
in  scheduling  objectives  that  minimize  average  task  latency  [77]  or  maximize  task 
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throughput  [150].  In  contrast,  for  systematic  tests,  value  could  accrue  as  state  space 
is  being  covered,  which  is  reflected  in  a  scheduling  policies  that  allocate  resources 
proportionally  to  the  estimated  test  length. 

Interestingly,  the  problem  of  deciding  which  systematic  test  to  advance  next  is 
similar  to  the  problem  of  multi-armed  bandit  [151].  In  short,  a  multi-armed  bandit 
problem  for  a  gambler  is  to  decide  which  arm  of  an  n-slot  machine  to  pull  to  maximize 
his  total  reward  in  a  series  of  trials.  To  bridge  the  research  on  multi-armed  bandits  with 
our  work,  one  needs  to  define  a  reward  function  for  systematic  tests.  The  search  for 
such  a  function  is  an  interesting  avenue  for  future  work. 


4.5  Conclusions 

This  chapter  presents  a  solution  to  the  problem  of  allocating  scarce  resources  to  a 
collection  of  systematic  tests.  The  solution  comes  in  the  form  of  different  allocation 
policies  based  on  techniques  for  estimating  the  length  of  a  systematic  test. 

In  the  context  of  this  chapter,  an  estimation  technique  consists  of  three  logical 
components  a  strategy,  an  estimator,  and  a  fit.  This  chapter  has  considered  three 
strategies:  eager,  oracle  (infeasible  in  practice),  and  lazy;  two  estimators:  weighted 
backtrack  and  recursive;  and  four  fits:  empty,  constant,  linear,  and  logarithmic. 

The  evaluation  of  the  different  combinations  of  these  components  reveals  that  the 
overall  average  accuracy  of  our  best  estimation  techniques  is  upwards  of  60%  after 
exploring  as  little  as  1%  of  the  state  space.  To  further  improve  the  estimation  accuracy, 
future  work  could  take  advantage  of  common  structural  properties,  such  as  the  power- 
law  of  random  networks  graphs  [10],  of  execution  trees. 

Besides  evaluating  the  accuracy  of  estimation  techniques,  this  chapter  also  investi¬ 
gated  the  ability  of  these  techniques  to  map  a  testing  objective  to  a  policy  for  allocating 
scarce  resources  to  a  collection  of  systematic  tests.  Two  testing  objectives  were  consid¬ 
ered:  1)  maximizing  the  number  of  completed  tests  and  2)  achieving  even  coverage 
across  different  tests.  For  each  of  these  objectives  an  allocation  policy  parametrized 
by  an  estimation  technique  was  designed  and  its  performance  evaluated  against  the 
performance  of  a  baseline  policy. 

The  experimental  evaluation  of  these  allocation  policies  revealed  that  while  the  lazy 
strategy  outperformed  the  eager  strategy  at  meeting  the  first  testing  objective,  the  eager 
strategy  outperformed  the  lazy  strategy  at  meeting  the  second  testing  objective.  This 
indicates  that  the  lazy  strategy  achieves  better  accuracy  when  estimating  the  absolute 
length  of  an  individual  test,  while  the  eager  strategy  achieves  better  accuracy  when 
estimating  the  ratio  between  lengths  of  different  tests. 

Further,  the  evaluation  demonstrated  that,  for  the  testing  objectives  considered  in 
this  chapter,  the  allocation  policies  based  on  test  length  estimation  improve  on  the 
round-robin  policy.  Future  work  along  these  lines  could  experiment  with  other  testing 
objectives,  such  as  maximizing  the  number  of  bugs  found,  which  is  not  consider  in  this 
chapter  since  the  exploration  traces  from  Google  do  not  contain  such  information. 
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Chapter  5 

Parallel  State  Space  Exploration 


This  chapter  presents  a  new  method  for  distributed  systematic  testing  of  concurrent 
programs,  which  pushes  the  limits  of  systematic  testing  to  an  unprecedented  scale. 
The  approach  presented  here  is  based  on  a  novel  exploration  algorithm  that  1)  enables 
trading  space  complexity  for  parallelism,  2)  achieves  load-balancing  through  time¬ 
slicing,  3)  provides  fault  tolerance,  a  mandatory  aspect  of  scalability,  4)  scales  to  more 
than  a  thousand  parallel  workers,  and  5)  is  guaranteed  to  avoid  redundant  exploration. 

The  rest  of  the  chapter  is  organized  as  follows.  Section  5.1  discusses  previous  work 
on  distributed  systematic  testing.  Section  5.2  presents  a  novel  exploration  algorithm 
and  details  its  use  for  distributed  systematic  testing  at  scale.  Section  5.3  presents  an 
experimental  evaluation.  Section  5.4  discusses  related  work  and  Section  5.5  draws 
conclusions. 


5.1  Background 

This  section  summarizes  distributed  dynamic  partial  order  reduction  (distributed 
DPOR)  [51,  167],  a  state  of  the  art  algorithm  for  distributed  systematic  testing  of 
concurrent  programs. 

Distributed  DPOR  targets  concurrent  exploration  of  branches  of  the  execution  tree. 
The  goal  of  distributed  DPOR  is  to  offset  the  combinatorial  explosion  of  possible 
permutations  of  concurrent  events  through  parallel  processing. 

Parallelization  of  execution  tree  exploration  seems  straightforward:  assign  different 
parts  of  the  execution  tree  to  different  workers  and  explore  the  execution  tree  concurrently. 
However,  as  pointed  out  by  Yang  et  al.  [167],  such  a  parallelization  suffers  from  two 
problems.  First,  due  to  the  non-local  nature  in  which  the  DPOR  algorithm  updates  the 
exploration  frontier  (cf.  Section  2.1.4),  different  workers  may  end  up  exploring  identical 
parts  of  the  state  space.  Second,  since  the  sizes  of  the  different  parts  of  the  execution 
tree  are  not  known  in  advance,  load-balancing  is  needed  to  enable  linear  speedup. 

To  address  these  two  problems,  Yang  et  al.  [167]  proposed  two  heuristics.  Their 
first  heuristic  modifies  DPOR's  lazy  addition  of  nodes  to  the  exploration  frontier  [51] 
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so  that  nodes  are  added  to  the  exploration  frontier  eagerly  instead.  As  evidenced  by 
their  experiments,  replacing  lazy  addition  with  eager  addition  mitigates  the  problem 
of  redundant  exploration  of  identical  parts  of  the  execution  tree  by  different  workers. 
Their  second  heuristic  assumes  the  existence  of  a  centralized  load-balancer  that  workers 
can  contact  in  case  they  believe  they  have  too  much  work  on  their  hands  and  would 
like  to  offload  some  of  the  work.  The  centralized  load-balancer  keeps  track  of  which 
workers  are  idle  and  which  workers  are  active  and  facilitates  offloading  of  work  from 
active  to  idle  workers. 


5.2  Methods 

While  scaling  DPOR  to  a  large  cluster  at  Google  [139],  several  shortcomings  of  the 
previous  work  [167]  were  identified.  First,  at  large  scale,  distributed  exploration  must 
be  able  to  cope  with  failures  of  worker  processes  or  machines.  Although  Yang  et 
al.  [167]  suggest  how  fault  tolerance  could  be  implemented,  they  do  not  quantify  how 
their  envisioned  support  for  fault  tolerance  would  affect  scalability.  Second,  although 
the  out-of-band  centralized  load-balancer  of  Yang  et  al.  renders  the  communication 
overhead  negligible,  it  is  not  clear  whether  the  centralized  approach  can  be  used  to 
support  additional  features,  such  as  fault  tolerance  and  state  space  size  estimation, 
without  becoming  a  bottleneck.  Third,  the  load-balancing  of  Yang  et  al.  uses  a  heuristic 
based  on  a  threshold  to  offload  work  from  active  to  idle  workers.  It  is  likely  that  for 
different  programs  and  different  numbers  of  workers,  different  threshold  values  should 
be  used.  However,  Yang  et  al.  provide  no  insight  into  the  problem  of  selecting  a  good 
threshold.  Fourth,  their  DPOR  modification  for  avoiding  redundant  exploration  is  a 
heuristic,  not  a  guarantee. 

This  section  presents  an  alternative  design  for  distributed  DPOR.  The  design  is 
centralized  and  uses  a  single  master  and  n  workers  to  explore  the  execution  tree. 
Despite  its  centralized  nature,  our  experiments  show  that  the  design  scales  to  more  than 
a  thousand  workers.  Unlike  previous  work  [167],  the  design  can  tolerate  worker  faults, 
is  guaranteed  to  avoid  redundant  exploration,  and  is  based  on  a  novel  exploration 
algorithm  that  allows  1)  trading  off  space  complexity  for  parallelism  and  2)  efficient 
load-balancing  through  time-slicing. 

5.2.1  Partitioned  Depth-First  Search 

The  key  advantage  of  using  depth-first  search  for  the  exploration  carried  out  by  the 
DPOR  algorithm  (cf.  Section  2.1.4)  is  its  favorable  space  complexity  [61].  This  fact 
is  the  main  reason  why  the  bottleneck  of  state  space  exploration  in  existing  tools  for 
systematic  testing  of  concurrent  programs  [51,  107,  137,  160]  is  the  CPU  speed  and  not 
the  memory  size.  This  is  hardly  surprising  given  that  the  time  complexity  of  the  DPOR 
algorithm  based  on  depth-first  search  exploration  is  linear  in  the  size  of  the  execution 
tree  and  quadratic  in  its  depth,  while  its  space  complexity  is  linear  in  its  depth. 
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To  enable  parallel  processing,  Yang  et  al.  [167]  depart  from  the  strict  depth-first 
search  nature  of  state  space  exploration.  Instead,  the  execution  tree  is  explored  using 
a  collection  of  possibly  overlapping  depth-first  searches  and  the  exploration  order  is 
determined  by  a  load-balancing  heuristic,  which  uses  an  ad-hoc  threshold  to  evenly 
distribute  unexplored  subtrees  of  the  execution  tree  across  the  worker  fleet. 

The  design  presented  here  uses  a  novel  exploration  algorithm,  called  n-partitioned 
depth-first  search,  which  relaxes  the  strict  depth-first  search  nature  of  traditional  state 
space  exploration  in  a  controlled  manner  and,  unlike  traditional  depth-first  search,  is 
amenable  to  parallelization. 

The  main  difference  between  depth-first  search  and  n-partitioned  depth-first  search 
is  that  the  exploration  frontier  of  the  new  algorithm  is  partitioned  into  up  to  n  frontier 
fragments  and  the  new  algorithm  explores  each  fragment  using  the  traditional  depth-first 
search,  interleaving  exploration  of  different  fragments. 

For  the  sake  of  the  presentation,  a  sequential  version  of  the  DPOR  algorithm  based 
on  the  n-partitioned  depth-first  search  is  first  presented  as  Algorithm  5.  The  algorithm 
maintains  an  exploration  frontier,  represented  as  a  set  of  up  to  n  stacks  of  sets  of 
nodes.  The  elements  of  the  exploration  frontier  are  referred  to  as  fragments  and  together 
they  form  a  partitioning  of  the  exploration  frontier.  The  execution  tree  is  explored 
by  interleaving  depth-first  search  exploration  of  frontier  fragments.  The  algorithm 
implements  this  idea  by  repeating  two  steps  -  Partition  and  Explore  -  until  the 
execution  tree  is  explored. 

During  the  Partition  step,  the  current  frontier  is  inspected  to  see  whether  existing 
frontier  fragments  should  be  and  can  be  further  partitioned.  A  new  frontier  fragment 
should  be  created  in  case  there  is  less  than  n  frontier  fragments.  A  new  frontier  fragment 
can  be  created  if  there  exists  a  frontier  fragment  with  at  least  two  nodes. 

The  Explore  step  is  given  one  of  the  frontier  fragments  and  uses  depth-first  search 
to  explore  the  next  edge  of  the  subtree  induced  by  the  selected  frontier  fragment  (the 
subtree  that  contains  all  ancestors  and  descendants  of  the  nodes  contained  in  the  selected 
frontier  fragment).  The  UpdateFrontier (frontier, fragment,  node)  function  operates  in  a 
similar  fashion  to  the  UpdateFrontier  (fro  n  t  ier,  node )  function  described  in  Chapter  2. 
The  main  distinction  is  that  after  the  new  version  of  the  function  identifies  which  nodes 
are  to  be  added  to  the  exploration  frontier  using  the  DPOR  algorithm,  these  nodes  are 
added  to  the  current  frontier  fragment  only  if  they  are  not  already  present  in  some 
other  fragment.  This  way,  the  set  of  sets  of  nodes  contained  in  each  fragment  remains 
a  partitioning  of  the  exploration  frontier  -  an  invariant  maintained  throughout  our 
exploration  that  helps  our  design  to  avoid  redundant  exploration. 

5.2.2  Parallelization 

This  subsection  describes  how  to  efficiently  parallelize  Algorithm  5.  First,  observe  that 
the  presence  or  absence  of  the  Partition  step  in  the  body  of  the  main  loop  of  the 
algorithm  has  no  effect  on  the  correctness  of  the  algorithm.  This  allows  us  to  sequence 
several  Explore  steps  together,  which  hints  at  possible  distribution  of  the  exploration. 
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Algorithm  5  SequentialScalableDynamicPartialOrderReduction^,  root) 
Require:  n  is  a  positive  integer  and  root  is  the  initial  program  state. 

Ensure:  All  program  states  reachable  from  root  are  explored. 

1:  procedure  Partition  (frontier,  n) 

2:  if  Size  (frontier)  =  n  then 

3:  return 

4:  end  if 

5:  for  all  fragment  G  frontier  do 

6:  while  Size  (frontier)  <  n  and  Size  (fragment)  >  1  do 

7:  node  < —  an  arbitrary  element  of  a  set  contained  in  fragment 

8:  remove  node  from  fragment 

9:  new-fragment  < —  a  new  frontier  fragment  for  node 

10:  Insert  ( new-fragmen  t,fro  ntier) 

11:  end  while 

12:  end  for 

13:  end  procedure 

14:  procedure  E xplore ( n od e,f ragmen t,fro n t ier) 

15:  remove  node  from  Top  (fragment) 

16:  UPDATEFRONTiER(/ronfzer,/raymenf,  node) 

17:  if  Children  (node)  not  empty  then 

18:  child  <—  arbitrary  element  of  Children (node) 

19:  Push({  child  }, fragment) 

20:  navigate  execution  to  child 

21:  end  if 

22:  pop  empty  sets  from  the  fragment  stack 

23:  end  procedure 

24:  frontier  < —  NewSet 

25:  Insert(Push(  {root},  NewStack ), frontier) 

26:  while  Size  (frontier)  >  0  do 
27:  Partition  (frontier,  n ) 

28:  fragment  -e-  an  arbitrary  element  of  frontier 

29:  node  < —  an  arbitrary  element  of  Top  (fragment) 

30:  Explore  ( node,f ragmen  t,fron  tier) 

31:  if  Size  (fragment)  =  0  then 

32:  Remove  (fragment,  frontier) 

33:  end  if 

34:  end  while 


Namely,  one  could  spawn  concurrent  workers  and  use  them  to  carry  out  sequences 
of  Explore  steps  over  different  frontier  fragments.  However,  a  straightforward  im¬ 
plementation  of  this  idea  would  require  synchronization  when  concurrent  workers 
access  and  update  the  exploration  frontier,  which  is  shared  by  all  workers.  The  trick  to 
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Algorithm  6  DistributedScalableDynamicPartialOrderReduction(;i,  budget,  root ) 
Require:  n  is  a  positive  integer,  budget  is  a  time  budget  for  worker  exploration,  and  root 
is  the  initial  program  state. 

Ensure:  All  program  states  reachable  from  root  are  explored. 

1:  procedure  Exp lo reC allb  ACK(old-fragmen t,  new-fragment, frontier) 

2:  replace  old-fragment  of  frontier  with  new-fragment 

3:  mark  new-fragment  as  unassigned 

4:  signal  main  exploration  loop 

5:  end  procedure 

6:  procedure  ExploreLoop(/ ragment,  budget, frontier) 

7:  start-time  GetTime 

8:  repeat 

9:  node  < —  an  arbitrary  element  of  Top  (fragment) 

10:  Explore  (node,  fragment,  frontier) 

11:  until  (GetTime  —  start-time  >  budget)  or  (Stze (fragment)  =  0) 

12:  end  procedure 

13:  frontier  < —  NewSet 

14:  Insert  (Push  (root,  N  e  wStack)  ,fron  tier) 

15:  while  Size  (frontier)  >  0  do 
16:  Partition  (frontier,  n) 

17:  while  exists  an  idle  worker  and  an  unassigned  frontier  fragment  do 

18:  fragment  u-  an  arbitrary  unassigned  element  of  frontier 

19:  Svawn  (ExeloreLoov  ,  fragment ,  budget,  ExploreCallback) 

20:  end  while 

21:  wait  until  signaled  by  ExploreCallback 

22:  end  while 


overcome  this  obstacle  to  efficient  parallelization  is  to  give  each  worker  a  private  copy 
of  the  execution  tree.  As  pointed  out  by  Yang  et  al.  [167],  such  a  copy  can  be  concisely 
represented  using  the  depth-first  search  stack  of  the  frontier  fragment  to  be  explored. 

A  worker  can  then  repeatedly  invoke  the  Explore  function  over  (a  copy  of)  the 
assigned  frontier  fragment.  Once  the  worker  either  completes  the  exploration  of  the 
assigned  frontier  fragment  or  it  exceeds  the  time  allotted  for  its  exploration,  it  reports 
back  with  the  results  of  the  exploration.  The  exploration  progress  can  be  concisely 
represented  using  the  original  and  the  final  state  of  the  depth-first  search  stack  of  the 
assigned  frontier  fragment. 

Algorithm  6  presents  a  high-level  approximation  of  the  actual  implementation  of  our 
design  for  scalable  DPOR.  The  implementation  operates  with  the  concept  of  fragment 
assignment.  When  a  frontier  fragment  is  created,  it  is  unassigned.  Later,  a  fragment 
becomes  assigned  to  a  particular  worker  through  the  invocation  of  the  Spawn  function. 
When  the  worker  finishes  its  exploration,  or  exhausts  the  time  budget  assigned  for 
exploration,  it  reports  back  the  results,  and  the  fragment  assigned  to  this  worker 
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becomes  unassigned  again.  The  results  of  worker  exploration  are  mapped  back  to  the 
"master"  copy  of  the  execution  tree  using  the  ExploreCallback  callback  function.  The 
time  budget  for  worker  exploration  is  used  to  achieve  load-balancing  through  time¬ 
slicing,  which  is  explained  in  Section  5.2.4.  Further,  Section  5.2.5  describes  a  mechanism 
to  resolve  conflicting  concurrent  updates.  The  Partition  function  behaves  identically 
to  the  original  one,  but  it  partitions  unassigned  fragments  only. 

Algorithm  6  presents  the  pseudo-code  of  the  ExploreLoop  function,  which  is  exe¬ 
cuted  by  a  worker.  The  Explore  function  is  identical  to  the  one  used  in  the  sequential 
version  of  the  algorithm  but  the  frontier  argument  uses  a  copy  of  the  frontier  that  con¬ 
tains  only  the  nodes  needed  to  further  the  exploration  of  the  assigned  frontier  fragment. 
The  workers  are  started  through  the  Spawn  function  which  creates  a  private  copy  of  a 
part  of  the  execution  tree.  Structuring  the  concurrent  exploration  in  this  fashion  enables 
both  multithreaded  and  multiprocess  implementations  of  our  design. 

Since  our  goal  has  been  to  scale  the  DPOR  algorithm  to  thousands  of  workers,  each 
worker  is  implemented  as  an  RPC  server  running  as  a  separate  process.  The  Spawn  func¬ 
tion  issues  an  asynchronous  RPC  request  that  triggers  invocation  of  the  ExploreLoop 
function  with  the  appropriate  arguments  at  the  RPC  server  of  the  worker.  The  response 
to  the  RPC  request  is  then  handled  asynchronously  by  the  ExploreCallback  function, 
which  maps  the  result  of  the  worker  exploration  into  the  master  copy  of  the  execution 
tree  and  prompts  another  iteration  of  the  main  loop  of  Algorithm  6. 

5.2.3  Fault  Tolerance 

As  is  commonly  done  in  large  distributed  applications  [32,  57],  failure  of  one  out  of 
thousands  of  nodes  will  be  common  [52]  and  must  be  handled  gracefully,  but  failure  of 
just  one  particular  node  is  infrequent  enough  to  be  dealt  with  using  re-execution.  In 
accordance  with  this  practice,  our  design  assumes  that  the  master,  which  is  running  the 
main  loop  of  Algorithm  6,  will  not  fail.  The  workers  on  the  other  hand  are  expected  to 
fail  and  the  exploration  can  tolerate  such  events. 

In  particular,  an  RPC  request  issued  by  the  master  to  a  worker  RPC  server  uses 
a  deadline  to  decide  whether  the  worker  has  failed.  The  value  of  the  deadline  is  set 
proportionally  larger  than  the  value  of  the  worker  time  budget.  When  the  deadline 
expires  without  an  RPC  response  arriving,  the  master  assumes  that  the  worker  has 
failed  and  unassigns  the  frontier  fragment  originally  assigned  to  the  failed  worker. 
Other  workers  are  then  able  to  be  assigned  its  exploration. 

5.2.4  Load-balancing 

The  key  to  high  utilization  of  a  worker  fleet  is  effective  load-balancing.  To  achieve 
load-balancing,  our  design  time-slices  frontier  fragments  among  available  workers.  The 
availability  of  frontier  fragments  is  impacted  by  two  factors. 

The  first  factor  is  the  upper  bound  n  on  the  number  of  frontier  fragments  that 
the  distributed  exploration  creates.  This  parameter  determines  the  size  of  the  pool  of 
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Figure  5.1:  Fixed  Time  Budget  Exploration 
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Figure  5.2:  Variable  Time  Budget  Exploration 


available  work  units.  The  higher  this  number,  the  higher  the  memory  requirements  of 
the  master  but  the  higher  the  opportunity  for  parallelism.  In  our  experience,  setting  n 
to  twice  the  number  of  workers  works  well.  Studying  dynamic  scaling  of  the  number  of 
frontier  fragments  is  an  interesting  avenue  for  future  work. 

The  second  factor  is  the  size  of  the  time  slice  used  for  worker  exploration.  Smaller 
time  slices  lead  to  more  frequent  generation  of  new  fragments  but  this  elasticity  comes  at 
the  cost  of  higher  communication  overhead.  Our  initial  design  used  a  fixed  time  budget, 
choosing  the  value  of  10  seconds  to  balance  elasticity  and  communication  overhead. 
However,  the  initial  evaluation  of  our  prototype  made  us  realize  that  a  variable  time 
budget  is  more  appropriate  for  large  worker  fleets. 

In  particular,  as  the  number  of  workers  in  a  fixed  budget  scheme  increases,  a  gap 
between  the  realized  and  the  ideal  speed  up  opens  up.  Our  intuition  led  us  to  believe 
this  could  be  caused  by  periods  of  time  during  which  the  exploration  has  insufficient 
number  of  frontier  fragments  to  keep  all  workers  busy.  To  study  this  problem,  the 
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master  was  modified  to  keep  track  of  the  number  of  active  workers  over  the  course 
of  an  exploration.  Figure  5.1  plots  this  information  for  one  of  our  test  programs  on  a 
configuration  with  1, 024  workers  and  an  upper  bound  of  2, 048  frontier  fragments.  The 
figure  is  representative  of  other  measurements  at  such  a  scale. 

One  can  identify  three  phases  of  the  exploration.  In  the  first  phase,  the  number 
of  active  workers  gradually  increases  over  100  seconds  until  there  is  enough  frontier 
fragments  to  keep  all  workers  busy.  In  the  second  phase,  all  workers  are  constantly  kept 
busy.  In  the  third  phase,  the  number  of  active  workers  gradually  decreases  to  zero  over 
100  seconds.  Ideally,  the  first  and  the  third  phase  should  be  as  short  as  possible  in  order 
to  minimize  the  inefficiency  resulting  from  not  fully  utilizing  the  available  worker  fleet. 

To  this  aim,  one  can  switch  from  a  fixed  time  budget  to  a  variable  time  budget.  In 
particular,  if  the  exploration  is  configured  to  use  a  time  budget  b,  the  master  actually 
uses  fractions  of  b  proportional  to  the  number  of  active  workers.  For  example,  the  first 
worker  will  receive  a  budget  of  y,  where  n  is  the  number  of  workers.  When  half  of 
the  workers  are  active,  the  next  worker  to  be  assigned  work  will  receive  a  budget  of  \ . 
The  scaling  of  the  time  budget  is  intended  to  reduce  the  time  before  the  master  has  the 
opportunity  to  re-partition  and  load-balance  and  thus  to  reduce  the  duration  of  the  first 
and  the  third  phase. 

Figure  5.2  plots  the  number  of  active  workers  over  time  for  the  optimized  implemen¬ 
tation  for  the  same  test  as  Figure  5.1.  For  this  test,  switching  to  a  variable  time  budget 
reduced  the  exploration  length  from  655  seconds  down  to  527  seconds.  Similar  runtime 
improvements  have  been  achieved  for  other  tests. 

5.2.5  Avoiding  Redundant  Exploration 

For  clarity  of  presentation.  Algorithm  6  omits  a  provision  that  prevents  concurrent 
workers  from  exploring  overlapping  portions  of  the  execution  tree.  This  could  happen 
when  two  workers  make  concurrent  UpdateFrontier  calls  and  add  identical  nodes  to 
their  frontier  fragment  copies. 

To  avoid  this  problem,  our  implementation  introduces  the  concept  of  node  ownership. 
A  worker  exclusively  owns  a  node  if  it  is  contained  in  the  original  frontier  fragment 
assigned  to  the  worker,  or  if  the  node  is  a  descendant  of  a  node  that  the  worker  owns. 
All  other  nodes  are  assumed  to  be  shared  with  other  workers  and  the  node  ownership 
restricts  which  nodes  a  worker  may  explore. 

In  particular,  the  depth-first  search  exploration  of  a  worker  is  allowed  to  operate  only 
over  nodes  that  the  worker  owns.  When  it  encounters  a  shared  node  during  its  explo¬ 
ration,  the  worker  terminates  its  exploration  and  sends  an  RPC  response  to  the  master 
indicating  which  nodes  of  the  frontier  fragment  are  shared.  The  ExploreCallback 
function  checks  the  status  of  the  newly  discovered  shared  nodes.  If  a  newly  discovered 
shared  node  is  not  part  of  some  other  frontier  fragment,  the  node  is  added  to  the  master 
copy  of  the  currently  processed  frontier  fragment  (ownership  is  claimed).  Otherwise, 
the  ownership  of  the  node  has  been  already  claimed  and  the  node  is  not  added  to  the 
master  copy  of  the  currently  processed  frontier  fragment. 
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Although  this  provision  could  in  theory  lead  to  increased  communication  overhead 
and  decreased  worker  fleet  utilization,  our  experiments  indicate  that  in  practice  the 
provision  does  not  affect  performance. 


5.3  Evaluation 

To  evaluate  the  design  presented  is  Section  5.2,  a  prototype  of  the  design  was  imple¬ 
mented  as  part  ETA  [139],  a  tool  developed  for  systematic  testing  of  multithreaded 
components  of  the  Omega  cluster  management  system  [129].  These  components  are 
written  using  a  library  based  on  the  actors  paradigm  [2],  To  exercise  different  concur¬ 
rency  scenarios,  ETA  systematically  enumerates  different  orders  in  which  messages 
between  actors  can  be  delivered. 


5.3.1  Experimental  Setup 

The  evaluation  used  instances  of  tests  from  the  Omega  test  suite  that  exercise  fun¬ 
damental  functionality  of  core  components  of  the  cluster  management  system.  The 
Resource(x)  test  is  representative  of  a  class  of  actor  program  tests  that  evaluate  interac¬ 
tions  of  x  different  users  that  acquire  and  release  resources  from  a  pool  of  x  resources. 
The  Scheduling(x)  test  is  representative  of  a  class  of  actor  program  tests  that  evaluate 
interactions  of  x  users  issuing  concurrent  scheduling  requests.  The  Store(x,y,z)  test  is 
representative  of  a  class  of  actor  program  tests  that  evaluate  interactions  of  x  users  of  a 
distributed  key-value  store  with  y  front-end  nodes  and  z  back-end  nodes. 

Unless  stated  otherwise,  each  measurement  presented  in  the  remainder  of  this 
section  presents  a  complete  exploration  of  the  given  test  and  the  results  report  the 
mean  and  the  standard  deviation  of  three  repetitions  of  the  exploration.  Lastly,  all 
experiments  were  carried  out  inside  of  a  Google  data  center  [65]  using  stock  hardware 
and  running  each  process  on  a  separate  virtual  machine. 


5.3.2  Faults 

First,  the  evaluation  focused  on  the  ability  of  the  implementation  to  handle  worker 
failures.  To  that  end,  the  implementation  was  extended  with  an  option  to  inject  an  RPC 
fault  with  a  certain  probability.  When  an  RPC  fault  is  injected,  the  master  fails  to  receive 
the  RPC  response  from  a  worker  and  waits  for  the  RPC  deadline  to  expire  instead. 

Our  experiments  demonstrated  that  the  runtime  increases  proportionally  to  the 
geometric  progression  of  repeated  RPC  failures.  For  example,  when  RPCs  have  a  50% 
chance  of  failing,  the  runtime  doubles.  Since  in  actual  deployments  of  ETA,  RPCs  fail 
with  probability  well  under  1%  [52],  our  support  for  fault  tolerance  is  practical. 
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Figure  5.3:  Scalability  Results  for  Resource(6) 
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Figure  5.4:  Scalability  Results  for  Scheduling(IO) 
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Figure  5.5:  Scalability  Results  for  Store(12,3,3) 


5.3.3  Scalability 

To  measure  the  scalability  of  the  implementation,  the  time  needed  to  complete  an  explo¬ 
ration  by  a  sequential  implementation  of  the  DPOR  algorithm  was  compared  against 
the  time  needed  to  complete  the  same  exploration  by  our  distributed  implementation. 
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Configurations  with  32,  64,  128,  256,  512,  and  1,024  workers  were  considered  and  the 
algorithm  was  used  to  explore  the  Resource(6),  Store(12,3,3),  and  Scheduling(IO) 
instances  of  actor  program  tests;  their  parameters  were  chosen  to  stimulate  interesting 
state  space  sizes.  The  time  budget  b  of  each  worker  exploration  was  set  to  10  seconds 
and  the  target  number  of  frontier  fragments  was  set  to  twice  the  number  of  workers. 

The  results  of  Resource(6),  Store(12,3,3),  and  Scheduling(IO)  experiments  are 
presented  in  Figures  5.3,  5.4,  and  5.5.  Due  to  the  magnitude  of  the  state  spaces  being 
explored  -  18.5  million,  21  million,  and  3.6  million  branches  respectively  -  the  runtime 
of  the  sequential  algorithm  was  extrapolated  using  a  partial  run  to  209,  215,  and  126 
hours  respectively.  The  graphs  visualize  the  speedup  over  the  extrapolated  runtime  of 
the  sequential  algorithm  and  compare  it  to  the  ideal  speedup.  These  results  evidence 
strong  scaling  of  our  implementation  of  the  DPOR  algorithm  at  a  large  scale.  The  largest 
configuration  uses  1,024  workers  and  our  implementation  achieves  speedup  that  ranges 
between  760  x  and  920  x . 

5.3.4  Theoretical  Limits 

Finally,  our  evaluation  focused  on  projecting  the  theoretical  scalability  limits  of  our 
implementation.  To  this  end,  the  memory  and  CPU  requirements  of  the  master  -  the 
obvious  bottleneck  in  our  centralized  design  -  were  measured  and  projected. 

Memory  Requirements:  The  memory  overhead  is  dominated  by  the  cost  to  store 
the  master  copy  of  the  exploration  frontier.  To  estimate  the  overhead,  the  amount  of 
memory  allocated  for  the  nodes  of  the  execution  tree  and  the  exploration  frontier  data 
structures  was  measured  over  the  course  of  an  exploration.  For  the  Scheduling(IO) 
test  on  a  configuration  with  1,024  workers  and  an  upper  bound  of  2,048  frontier 
fragments,  the  peak  amount  of  the  allocated  memory  was  less  than  4  MBs.  This  number 
is  representative  of  results  for  other  tests  at  such  a  scale.  Thus,  for  the  current  computer 
architectures,  the  typical  memory  systems  support  scaling  to  millions  of  workers. 

CPU  requirements:  With  1024  workers  and  a  10-second  time  budget,  the  master  is 
expected  to  issue  around  100  RPC  requests  and  to  process  around  100  RPC  responses 
every  second.  For  such  a  load,  the  stock  hardware  running  exclusively  the  master 
process  experienced  peak  CPU  utilization  under  20%.  Consequently,  for  the  current 
computer  architectures,  the  CPU  requirements  scale  to  around  5, 000  workers.  To  scale 
our  implementation  beyond  that,  one  can  proportionally  increase  the  time  budget, 
upgrade  to  better  hardware,  or  optimize  the  software  stack.  For  instance,  one  could 
replace  the  master  with  a  hierarchy  of  masters. 

5.4  Related  Work 

Algorithms 

Dwyer  et  al.  [45]  presented  a  parallel  algorithm  that  explores  the  state  space  using  a 
number  of  independent  randomized  depth-first  searches  to  decrease  the  time  needed  to 
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locate  an  error.  In  comparison,  our  parallelization  of  systematic  testing  aims  to  cover 
the  full  state  space  faster. 

The  work  of  Staats  and  Pasareanu  [143]  targets  parallelization  of  symbolic  execution 
using  randomness  and  static  partitioning.  The  parallelization  presented  in  this  chapter 
is  systematic  and  uses  dynamic  partitioning. 

Tools 

Inspect  [166]  is  a  tool  for  systematic  testing  of  pthreads  C  programs  that  implements 
the  distributed  DPOR  [167]  discussed  in  Section  5.1.  Unlike  our  work,  the  Inspect  tool 
does  not  support  fault  tolerance,  is  not  guaranteed  to  avoid  redundant  exploration,  and 
has  not  been  demonstrated  to  scale  beyond  64  workers. 

DeMeter  [67]  provides  a  framework  for  extending  existing  sequential  model  check¬ 
ers  [82,  160]  with  a  parallel  and  distributed  exploration  engine.  Similar  to  our  work,  the 
framework  focuses  on  efficient  state  space  exploration  of  concurrent  programs.  Unlike 
our  work,  the  design  has  not  been  thoroughly  described  or  analyzed  and  has  been  only 
demonstrated  to  scale  up  to  32  workers. 

Cloud9  [27]  is  a  parallel  engine  for  symbolic  execution  of  sequential  programs. 
In  comparison  to  our  work,  the  state  space  being  explored  is  the  space  of  possible 
programs  inputs,  not  schedules.  Systematic  enumeration  of  different  program  inputs  is 
an  orthogonal  problem  to  the  one  addressed  by  this  chapter. 

Parallelization  of  software  verification  was  also  investigated  in  the  context  of  explicit 
state  space  model  checkers  such  as  MurPhi  [144],  DiVinE  [12],  or  SWARM  [73].  Stateful 
exploration  is  less  common  in  implementation-level  model  checkers  as  storing  a  program 
state  explicitly  becomes  prohibitively  expensive. 


5.5  Conclusions 

This  chapter  presented  a  technique  that  improves  the  state  of  the  art  of  scalable  tech¬ 
niques  for  systematic  testing  of  concurrent  programs.  Our  design  for  distributed  DPOR 
enables  the  exploitation  of  a  large  scale  cluster  for  the  purpose  of  systematic  testing. 
At  the  core  of  the  design  lies  a  novel  exploration  algorithm,  n-partitioned  depth-first 
search,  which  has  proven  to  be  essential  for  scaling  our  design  to  thousands  of  workers. 

Unlike  previous  work  [167],  our  design  provides  support  for  fault  tolerance,  a 
mandatory  aspect  of  scalability,  and  is  guaranteed  to  avoid  redundant  exploration  of 
identical  parts  of  the  state  space  by  different  workers.  Further,  our  implementation  and 
deployment  in  a  real-world  system  at  scale  has  demonstrated  that  the  design  achieves 
almost  linear  speed  up  for  up  to  1,024  workers.  Lastly,  theoretical  analysis  of  our 
design  discussed  its  scalability  limits  and  proposed  solutions  for  scaling  the  design  for 
next-generation  computer  clusters. 
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Chapter  6 

Restricted  Runtime  Scheduling 


Our  accelerating  computational  demand  and  the  rise  of  multicore  hardware  have  made 
multithreaded  programs  increasingly  pervasive  and  critical.  Yet,  these  programs  remain 
extremely  difficult  to  write,  test,  analyze,  debug,  and  verify.  A  key  reason  is  that,  for 
decades,  the  contract  between  developers  and  thread  runtimes  has  favored  performance 
over  correctness.  In  this  contract,  developers  use  synchronizations  to  coordinate  threads, 
while  thread  runtimes  can  use  any  of  the  exponentially  many  thread  interleavings,  or 
schedules,  compliant  with  the  synchronizations.  This  large  number  of  possible  schedules 
make  it  more  likely  that  the  runtime  finds  an  efficient  schedule  for  a  workload.  However, 
ensuring  that  all  schedules  are  correct  is  extremely  challenging,  and  a  single  missed 
schedule  may  surface  in  the  least  expected  moment,  triggering  concurrency  errors 
responsible  for  possibly  critical  failures  [91,  97,  122]. 

To  simplify  testing,  debugging,  record-replay,  and  program  behavior  replication,  a 
number  of  recent  systems  [7,  15,  17,  18,  39,  40,  41,  42,  109]  aim  to  flip  this  performance- 
correctness  trade-off  through  restricted  runtime  schediding,  an  approach  that  dramatically 
limits  the  number  of  schedules  an  execution  of  a  multithreaded  program  may  use. 

Restricted  runtime  scheduling  is  orthogonal  to  systematic  testing  but  there  is  synergy 
between  the  two  approaches.  Restricted  runtime  scheduling  reduces  the  number  of 
schedules  systematic  testing  needs  to  check,  while  systematic  testing  ascertains  that  the 
schedules  allowed  by  restricted  runtime  scheduling  have  all  been  tested. 

The  goal  of  this  chapter  is  to  demonstrate  and  quantify  the  benefits  of  combining 
restricted  runtime  scheduling  with  systematic  testing.  To  this  end,  this  chapter  presents 
an  ecosystem  formed  by  combining  dBug  with  Parrot  [39],  an  implementation  of 
restricted  runtime  scheduling  for  POSIX-compliant  operating  systems.  The  state  space 
estimation  of  dBug  is  then  used  to  contrast  the  number  of  schedules  dBug  needs  to 
check  under  nondeterministic  scheduling  to  the  number  of  schedules  dBug  needs  to 
check  under  Parrot's  restricted  runtime  scheduling. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  6.1  provides  an  overview 
of  restricted  runtime  scheduling.  Section  6.2  describes  the  integration  of  dBug  and 
Parrot.  Section  6.3  presents  the  evaluation  results.  Section  6.4  discusses  related  work, 
and  Section  6.5  draws  conclusions. 
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6.1  Background 

This  section  first  presents  an  evolution  of  restricted  runtime  scheduling  techniques 
(§6.1.1)  and  then  describes  Parrot  [39],  an  implementation  of  restricted  runtime  schedul¬ 
ing  for  POSIX-compliant  operating  systems  (§6.1.2). 

6.1.1  Techniques 

To  drive  performance  of  multithreaded  programs,  traditional  thread  runtimes  are 
allowed  to  choose  any  of  the  exponentially  many  schedules  compliant  with  thread 
synchronizations  multithreaded  programs  use.  This  approach,  referred  to  as  nondeter- 
ministic  multithreading,  allows  many-to-many  mapping  between  program  inputs  and 
schedules  as  depicted  in  Figure  6.1.  The  nondeterministic  nature  of  runtime  scheduling 
makes  completing  the  astronomical  amount  of  testing  of  multithreading  infeasible. 

In  contrast  to  nondeterministic  multithreading,  deterministic  multithreading  (DMT)  [7, 
15,  17,  18,  40,  41,  42,  109]  reduces  the  number  of  allowed  schedules  by  deterministically 
mapping  each  input  to  a  schedule  as  depicted  in  Figure  6.2.  Thus,  with  DMT,  executions 
of  the  same  program  on  the  same  input  and  hardware  always  exhibit  the  same  behavior. 

Although  DMT  enables  replication  of  program  behavior,  it  does  not  reduce  the 
testing  burden  as  much  as  one  might  expect.  The  reason  for  this  shortcoming  of  DMT 
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Figure  6.4:  Nondeterministic  Stable  Multithreading 


is  that  different  inputs  can  be  mapped  to  different  schedules,  disguising  scheduling 
nondeterminism  as  input  nondeterminism.  To  address  this  problem,  DMT  can  be 
complemented  with  stable  multithreading  (SMT)  [7, 17, 18,  40,  41]  that  reduces  the  set  of 
schedules  for  all  inputs  by  mapping  similar  inputs  to  the  same  schedule  as  depicted  in 
Figure  6.3.  This  approach  considerably  reduces  the  number  of  allowed  schedules  and 
the  effort  needed  to  test  a  program. 

While  combining  DMT  with  SMT  makes  testing  of  multithreaded  programs  easier, 
mapping  many  inputs  to  the  same  schedule  can  lead  to  artificial  serialization  of  concur¬ 
rent  computation,  resulting  in  performance  loss  [17,  40,  41].  To  address  this  problem, 
recent  work  [39,  161,  162]  advocates  the  use  of  SMT  that  may  be  nondeterministic  but 
keeps  the  amount  of  nondeterminism  small,  resulting  in  a  many-to-few  mapping  as 
depicted  in  Figure  6.4.  Nondeterministic  SMT  balances  performance  and  testability  by 
allowing  the  runtime  to  choose  an  efficient  schedule  from  a  set  of  schedules  based  on 
the  current  timing,  while  keeping  the  set  of  allowed  schedules  small  in  order  to  reduce 
the  effort  needed  to  check  all  schedules. 

6.1.2  Parrot 

Parrot  [39]  is  an  implementation  of  nondeterministic  SMT  for  POSIX-compliant  oper¬ 
ating  systems.  Similar  to  dBug,  it  uses  runtime  interposition  to  intercept  and  order 
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Figure  6.5:  Performance  Flints  API 


invocations  of  the  POSIX  interface.  Unlike  dBug,  which  serializes  the  program  transi¬ 
tions  delimited  by  POSIX  interface  invocations.  Parrot  only  serializes  POSIX  interface 
invocations  themselves,  overlapping  computation  that  happens  in  between.  Further, 
Parrot  only  orders  pthreads  synchronizations  [120]. 

To  this  end.  Parrot  maintains  a  run  queue  that  identifies  program  threads  that  can 
make  progress  and  a  ivait  queue  that  identifies  program  threads  that  are  blocked.  By 
default.  Parrot  uses  deterministic  SMT,  scheduling  pthreads  synchronizations  invoked 
by  threads  on  the  run  queue  in  a  round-robin  fashion.  In  addition  to  that.  Parrot  exports 
a  simple  API,  referred  to  as  performance  hints,  through  which  programs  can  override  the 
default  scheduling  policy  of  Parrot,  possibly  resulting  in  nondeterministic  SMT.  Parrot 
supports  two  types  of  hints  -  soft  barriers  and  performance-critical  sections  -  and  their  API 
is  depicted  in  Figure  6.5. 

A  soft  barrier  can  be  used  to  express  co-scheduling  intent  [112].  In  that  sense,  it  acts 
as  a  traditional  barrier.  The  difference  between  a  soft  barrier  and  a  traditional  barrier 
is  that  a  soft  barrier  can  time  out.  The  reason  for  allowing  a  soft  barrier  to  time  out  is 
to  address  situations  where  the  programmer  cannot  accurately  predict  the  number  of 
threads  that  would  join  the  soft  barrier.  The  timeout  argument  specifies  the  maximum 
number  of  Parrot's  scheduling  decisions  a  soft  barrier  waits  before  deterministically 
releasing  all  waiting  threads.  The  ability  to  time  out  makes  a  soft  barrier  more  robust 
to  developer  mistakes  than  a  traditional  barrier  because  schedulers  cannot  ignore  a 
traditional  barrier  [44]. 

A  performance-critical  section  can  be  used  to  identify  a  code  region  that  the  sched¬ 
uler  should  execute  as  quickly  as  possible.  To  this  end.  Parrot  schedules  pthreads 
synchronizations  inside  of  a  performance-critical  section  as  soon  as  possible,  overriding 
the  default  round-robin  scheduling  policy.  Thus,  unlike  a  soft  barrier,  a  performance- 
critical  section  can  result  in  nondeterministic  scheduling.  However,  in  contrast  to 
traditional  nondeterministic  multithreading  that  uses  nondeterministic  scheduling  by 
default,  performance-critical  sections  allow  developers  to  explicitly  identify  code  regions 
that  benefit  from  nondeterministic  scheduling  for  improved  performance. 

Figure  6.6  illustrates  the  mechanism  Parrot  uses  for  controlling  scheduling  of  thread 
synchronizations.  Similar  to  dBug's  interposition  layer.  Parrot  intercepts  invocations  of 
the  pthreads  synchronizations  and  schedules  them  in  a  round-robin  order.  This  order 
can  be  overridden  by  performance  hints.  Soft  barriers  steer  Parrot's  scheduler  towards 
a  more  efficient  schedule  without  introducing  nondeterminism,  while  performance- 
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Figure  6.6:  Controlling  Thread  Scheduling  with  Parrot  Interposition  Layer 


1  #include  <pthread.h> 

2  #include  <stdio.h> 

3 

4  pthread_mutex_t  mutex; 

5  int  x  =  0; 

6 

7  void  *foo (void  *args)  { 

8  pthread_mutex_lock ( Smutex) ; 

9  x++; 

10  pthread_mutex_unlock  ( Smutex)  ; 
n  return  NULL; 

12  } 

13 

14  int  main  (int  argc,  char  **argv)  { 
is  pthread_t  tid; 

16  pthread_mutex_create ( Smutex,  NULL); 

17  pthread_create ( Stid,  NULL,  foo,  NULL); 
is  pthread_mutex_lock  ( Smutex)  ; 

19  x++; 

20  pthread_mutex_unlock  ( Smutex)  ; 

21  pthread_  join  (tid,  NULL); 

22  pthread_mutex_destroy  ( Smutex)  ; 

23  assert  (x  ==  2  )  ; 

24  return  0; 

25  } 


Figure  6.7:  Concurrent  pthreads  Synchronizations  -  Source  Code 


critical  sections  temporarily  exclude  some  threads  from  Parrot's  scheduling,  introducing 
nondeterminism  by  allowing  multiple  schedules. 

Example  9.  To  illustrate  Parrot's  operation,  let  us  consider  the  example  program  de¬ 
picted  in  Figure  6.7.  In  this  example,  a  thread  spawns  a  child  thread  and  both  of 
the  threads  then  use  a  mutex  to  protect  their  concurrent  updates  to  a  global  variable. 
When  executed  with  Parrot,  Parrot  intercepts  invocations  of  POSIX  interface  functions, 
internally  maintaining  a  run  queue  and  a  wait  queue  and  controlling  the  order  in 
which  pthreads  synchronizations  happen.  Figure  6.8  depicts  the  sequence  of  abstract 
program  states  explored  by  Parrot.  Each  state  identifies  the  value  of  the  global  variable 
x  and  the  control  flow  position  of  both  threads.  After  the  child  thread  is  created,  the  run 
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Figure  6.8:  Concurrent  pthreads  Synchronizations  -  Parrot  Execution 


queue  contains  both  the  parent  thread  and  the  child  thread.  Traditionally,  the  OS  sched¬ 
uler  could  schedule  the  pthreads  synchronization  of  either  of  the  two  threads  next. 
In  contrast  to  that.  Parrot  suspends  execution  of  both  threads  and  uses  a  deterministic 
round-robin  policy  to  determine  that  the  parent  thread  should  be  scheduled  first.  Note 
that  Parrot  understand  the  semantics  of  pthreads  synchronizations  and  updates  the 
run  queue  and  wait  queue  accordingly.  For  example,  when  the  parent  thread  acquires 
the  mutex,  the  child  thread  is  moved  from  the  run  queue  to  the  wait  queue. 

Note  that  Parrot  assumes  that  no  thread  whose  synchronizations  are  ordered  by 
Parrot's  scheduler  can  block  outside  of  Parrot.  Obviously,  this  assumption  does  not  hold 
for  real-world  programs  that  use  inter-process  synchronizations  that  may  block,  such  as 
poll  ( ) ,  select  ( ) ,  or  epoll_wait  ( ) .  To  account  for  inter-process  synchronizations. 
Parrot  treats  their  invocations  as  if  they  were  included  in  a  performance-critical  section, 
effectively  excluding  them  from  Parrot's  scheduling  algorithm. 
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Figure  6.9:  Layering  of  Parrot  and  dBug  Interposition 


6.2  Methods 

The  idea  behind  nondeterministic  SMT  is  to  improve  testability  of  multithreaded  pro¬ 
grams  without  hurting  performance.  To  demonstrate  that  this  idea  translates  to  practice, 
this  section  presents  an  integration  of  Parrot,  an  implementation  of  nondeterminis¬ 
tic  SMT,  and  dBug,  a  systematic  testing  tool.  The  restricted  runtime  scheduling  of 
Parrot  reduces  the  number  of  schedules  multithreaded  programs  can  use.  However, 
performance-critical  hints  or  invocations  of  inter-process  synchronizations  can  still  act  as 
a  source  of  scheduling  nondeterminism.  Consequently,  dBug  is  used  to  systematically 
enumerate  the  reduced  set  of  schedules  allowed  by  Parrot. 


6.2.1  Interposition  Layering 

Parrot  and  dBug  are  both  implemented  using  a  similar  concept:  use  runtime  interposi¬ 
tion  to  intercept  and  order  invocations  of  the  POSIX  interface.  The  key  insight  to  their 
integration  is  that  for  the  purpose  of  systematic  testing.  Parrot  can  be  viewed  as  part 
of  the  multithreaded  program.  This  insight  is  captured  by  the  layered  design  of  the 
integration  depicted  in  Figure  6.9,  which  combines  the  interposition  of  Parrot  and  dBug 
depicted  in  Figures  6.6  and  3.7  respectively. 

A  naive  layering  of  Parrot  and  dBug  may,  however,  introduce  artificial  deadlocks.  To 
understand  why  this  might  happen,  let  us  consider  the  execution  depicted  in  Figure  6.8 
but  this  time  with  dBug  in  the  picture.  When  Parrot  reaches  the  state  in  which  both 
the  parent  thread  and  the  child  thread  are  on  the  run  queue,  it  uses  its  deterministic 
round-robin  policy  to  schedule  the  synchronization  of  the  parent  thread  first,  moving 
the  child  thread  to  the  wait  queue.  This  synchronization  is  then  intercepted  by  dBug 
but,  following  its  scheduling  policy,  dBug  will  not  make  a  scheduling  decision  until 
it  intercepts  an  event  of  the  child  thread.  At  this  point,  the  parent  thread  is  blocked 
in  dBug  waiting  for  Parrot  to  make  progress  and  the  child  thread  is  blocked  in  Parrot 
waiting  for  dBug  to  make  progress.  The  crux  of  this  problem  is  that  dBug  does  not 
know  what  threads  are  blocked  in  Parrot. 
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void  thread_waiting ( ) ; 

void  thread_running  (pthread__t  tid)  ; 


Figure  6.10:  Thread  Status  API 

6.2.2  Scheduling  Coordination 

To  overcome  this  problem,  dBug  was  extended  with  an  interface  that  Parrot  (or  any 
other  program  for  that  matter)  can  use  to  inform  dBug  that  1)  the  execution  of  a 
running  thread  has  been  suspended,  or  2)  the  execution  of  a  suspended  thread  has 
been  resumed.  This  interface  is  depicted  in  Figure  6.10.  The  thread_wait ing  () 
function  can  be  used  to  inform  dBug  that  the  calling  thread  is  going  to  be  suspended, 
while  the  thread_running  ( )  function  can  be  used  to  inform  dBug  that  the  thread 
identified  by  the  tid  argument  is  going  to  be  resumed.  Further,  the  scheduling 
algorithm  of  the  dBug  arbiter  was  modified  to  treat  threads  that  have  invoked  the 
thread_wait  ing  ( )  function  as  quiesced  until  they  are  referenced  by  an  invocation 
of  the  thread_running  ( )  function.  The  implementation  of  the  new  interface  and  the 
scheduling  algorithm  modification  required  less  than  50  lines  of  code.  Finally,  Parrot 
was  annotated  with  calls  to  the  new  dBug  interface  to  match  the  movement  of  threads 
between  Parrot's  run  queue  and  wait  queue,  requiring  less  than  20  lines  of  code.  With 
the  new  interface  and  annotations  in  place,  the  layering  interposition  of  Parrot  and 
dBug  works  seamlessly. 


6.3  Evaluation 

To  quantify  the  benefits  of  integrating  restricted  runtime  scheduling  with  systematic 
testing,  our  integration  of  Parrot  and  dBug  was  evaluated  on  a  diverse  set  of  108 
workloads  [39].  This  set  includes  55  real-world  workloads  based  on  the  following 
programs: 

•  bdb,  a  widely  used  database  library  [108] 

•  openldap,  a  server  implementing  the  Lightweight  Directory  Access  Protocol  [110] 

•  redis,  a  fast  key-value  data  store  server  [124] 

•  mplayer,  a  popular  media  encoder,  decoder,  and  player  [104] 

•  pbzip2,  a  parallel  compression  utility  [59] 

•  pf  scan,  a  parallel  grep-  like  utility  [1 13] 

•  aget,  a  parallel  file  download  utility  [1] 

•  33  parallel  C++  STL  algorithm  implementations  [53,  142] 

•  14  parallel  image  processing  utilities  in  the  ImageMagick  software  suite  [76] 
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Further,  this  set  also  includes  53  workloads  from  four  widely  used  benchmark  suites: 

•  15  workloads  from  PARSEC  [21] 

•  14  workloads  from  Phoenix  [168] 

•  14  workloads  from  SPLASH-2x  [20,  141] 

•  10  workloads  from  NPB  [8] 

No  workloads  from  these  benchmark  suites  were  excluded  from  our  considera¬ 
tion  to  avoid  biasing  our  results.  The  benchmark  suites  cover  a  number  of  different 
programming  languages,  including  C,  C++,  and  Fortran,  and  a  plethora  of  different 
parallel  programming  models  and  idioms  such  as  pthreads  [120],  OpenMP  [111],  data 
partition,  fork-join,  pipeline,  map-reduce,  and  work-pile. 

6.3.1  Experimental  Setup 

The  evaluation  used  a  2.80  GHz  dual-socket  hex-core  Intel  Xeon  with  24  cores  and  64 
GB  memory  running  Ubuntu  12.04. 

The  bdb  workload  is  based  on  a  popular  benchmark  bench3n  [4],  which  does  fine¬ 
grained,  highly  concurrent  transactions.  The  openldap  and  redis  workloads  are  both 
based  on  benchmarks  included  in  their  distribution.  The  mp  layer  workload  is  based 
on  its  utility  mencoder,  transcoding  a  255  MB  video  from  MP4  to  AVI.  For  pbzip2, 
one  workload  compresses  a  145  MB  binary  file  and  another  workload  decompresses 
the  compressed  version  of  this  file.  The  pf  scan  workload  searches  for  the  keyword 
return  in  16K  files  contained  in  /usr/include  on  our  evaluation  machine.  The 
aget  workload  downloads  a  656  MB  file  from  a  local  web  server.  All  ImageMagick 
workloads  use  a  33  MB  JPG  file  as  an  input.  All  33  parallel  STL  algorithms  use  integer 
vectors  with  256M  elements.  Workloads  from  the  PARSEC,  Phoenix,  SPLASH-2x,  and 
NPB  benchmark  suites  used  the  smallest  configuration  of  these  workloads.  In  particular, 
all  workloads  in  our  evaluation  used  two  to  four  threads. 

All  multiprocess  client-server  workloads  (openldap,  redis,  and  aget)  use  a 
simple  binary  that  drives  the  workload.  This  binary  first  starts  the  server  process,  next 
it  starts  the  client  process,  next  it  waits  for  the  client  process  to  terminate,  and  finally  it 
kills  the  server  process,  which  concludes  the  workload. 

Note  that  using  relatively  small  inputs  and  a  small  number  of  threads  produces  a 
lower  bound  on  the  number  of  schedules  of  workloads  with  more  threads  and  larger 
inputs.  Our  measurements  (§  6.3.2)  show  that  even  for  these  relatively  small  workloads, 
the  extent  of  state  space  explosion  is  considerable  and  the  benefits  of  combining 
restricted  runtime  scheduling  with  systematic  testing  far  exceed  the  benefits  of  any 
other  practical  state  space  reduction  technique  [51,  67], 

All  programs  were  compiled  using  gcc  -02.  To  support  OpenMP  programs  such 
as  parallel  STL  algorithms  and  NPB,  the  GNU  libgomp  implementation  of  OpenMP 
was  used.  Five  programs  use  ad-hoc  synchronization  [159],  and  sched_yield  was 
added  to  their  busy-wait  loops  to  make  these  programs  work  with  Parrot  and  dBug. 
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Finally,  the  evaluation  used  the  performance  hints  added  by  authors  of  Parrot  [39]. 
Of  all  108  programs,  18  have  reasonable  overhead  with  the  default  Parrot  schedule, 
requiring  no  hints.  81  programs  need  a  total  of  87  lines  of  soft  barrier  hints:  43  need 
only  4  lines  of  generic  soft  barrier  hints  in  libgomp,  and  38  need  program-specific  soft 
barrier  hints.  These  programs  enjoy  both  determinism  and  reasonable  performance. 
Only  9  programs  need  a  total  of  22  lines  of  performance-critical  hints,  introducing 
isolated  scheduling  nondeterminism  to  achieve  good  performance. 


6.3.2  Results 

The  results  of  our  evaluation  are  divided  into  two  groups.  The  first  group  contains 
96  workloads  for  which  Parrot  uses  only  one  schedule.  These  are  all  multithreaded 
workloads  that  either  use  no  hints  or  soft  barrier  hints  only.  The  second  group  contains 
12  workloads  for  which  Parrot  uses  multiple  schedules.  These  are  all  multiprocess 
workloads  (aget,  open  1  dap,  and  redis)  and  all  multithreaded  workloads  that  use 
performance-critical  hints. 


Single  Schedule  Workloads 

The  first  group  of  workloads  does  not  require  dBug  to  be  used  for  testing  as  the  only 
schedule  allowed  by  Parrot  can  be  exercised  by  simply  running  the  workload  in  Parrot. 
Nevertheless,  to  estimate  the  state  space  reduction  realized  by  Parrot,  all  96  workloads 
were  tested  with  dBug  without  using  Parrot.  In  particular,  dBug  repeatedly  executed 
each  workload  until  it  enumerated  all  schedules  or  the  time  out  of  24  hours  was  reached 
in  which  case  dBug's  state  space  estimation  based  on  the  lazy  strategy,  the  weighted- 
backtrack  estimator,  and  the  empty  fit  (cf.  Chapter  4)  was  used  to  estimate  the  time 
needed  to  explore  all  possible  schedules. 

Figure  6.11  depicts  the  results  of  this  experiment.  The  vertical  axis  plots  the  runtime 
of  systematic  enumeration  of  all  schedules  of  the  workloads  identified  by  the  horizontal 
axis.  The  workloads  whose  schedules  could  be  exhaustively  enumerated  in  24  hours 
are  identified  by  green  color,  while  the  workloads  whose  schedules  could  not  be 
exhaustively  enumerated  in  24  hours  are  identified  by  red  color.  The  dotted  horizontal 
line  identifies  one  machine  day  and  the  solid  horizontal  line  identifies  the  one  machine 
year  of  the  sum  of  Top500  supercomputers  [102],  The  graph  shows  results  for  95  out 
of  the  96  workloads  for  which  Parrot  uses  only  one  schedule.  Notably,  the  graph  is 
missing  a  result  for  the  SPLASH-2x  volrend  workload.  This  workload  causes  dBug  to 
run  out  of  memory  trying  to  represent  a  single  branch  of  the  execution  tree  with  more 
than  57M  nodes. 

The  results  for  the  first  group  of  workloads  offer  two  insights.  First,  using  Parrot 
increases  the  number  of  workloads  whose  schedules  can  be  exhaustively  enumerated 
from  46  to  96.  Second,  the  state  space  reduction  realized  by  combining  dBug  with  Parrot 
ranges  up  to  lO100,000  and  can  be  expected  to  be  even  more  dramatic  for  programs  with 
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Figure  6.11:  Actual  and  Estimated  Runtime  of  Systematic  Tests 


Workload 

dBug 

dBug  +  Parrot 

Time 

partition 

1.37  x  107 

8,194 

307s 

partial_sort 

1.37  x  107 

8,194 

307s 

nth_element 

1.35  x  107 

8,224 

309s 

pf scan 

2.43  x  102117 

32,268 

1,201s 

redis 

1.26  x  108 

9.11  x  107 

- 

aget 

2.05  x  1017 

5.11  x  1010 

- 

fmm 

1.25  x  1078 

2.14  x  1054 

- 

cholesky 

1.81  x  10371 

5.99  x  10152 

- 

f luidanimate 

2.72  x  10218 

2.64  x  10218 

- 

openldap 

2.40  x  102795 

5.70  x  101048 

- 

raytrace 

1.08  x  1013863 

3.68  x  1013755 

- 

Table  6.1:  Estimated  Runtime  of  Systematic  Tests 

larger  inputs  and  more  threads.  This  reduction  greatly  surpasses  all  previous  methods 
for  state  space  reduction  of  systematic  testing  tools  [51,  67]. 

Multiple  Schedules  Workloads 

The  second  group  of  workloads  requires  dBug  to  be  used  for  testing  as  the  workloads 
from  this  group  can  exercise  any  of  the  multiple  schedules  allowed  by  Parrot.  Each 
workload  from  this  group  was  tested  using  dBug  both  with  and  without  Parrot  to 
quantify  and  contrast  the  extent  of  state  space  explosion.  In  particular,  dBug  repeated 
execution  of  each  workload  until  it  enumerated  all  of  its  schedules  or  the  time  out  of 
24  hours  was  reached  in  which  case  dBug's  state  space  estimation  based  on  the  lazy 
strategy,  the  weighted-backtrack  estimator,  and  the  empty  fit  (cf.  Chapter  4)  was  used 
to  estimate  the  time  needed  to  explore  all  possible  schedules. 

Table  6.1  details  the  individual  results  for  all  12  workloads  from  the  second  group 
except  for  the  NPB  ua  for  which  dBug  runs  out  of  memory  trying  to  represent  a  single 
branch  of  the  execution  tree  with  more  than  127M  nodes.  The  first  column  identifies 
the  workload,  the  second  column  identifies  the  estimated  number  of  schedules  when 
the  workload  is  run  without  Parrot,  the  third  column  identifies  the  estimated  number 
of  schedules  when  the  workload  is  run  with  Parrot,  and  the  fourth  column  reports  the 
time  it  took  dBug  to  exhaustively  enumerate  all  schedules  allowed  by  Parrot,  where  the 
value  identifies  that  dBug  timed  out  after  24  hours. 

The  results  for  the  second  group  of  workloads  show  that  using  Parrot  helps  dBug 
exhaustively  enumerate  schedules  for  4  of  the  12  workloads.  For  another  5  of  the 
remaining  8  workloads,  using  Parrot  reduces  the  estimated  number  of  schedules  by 
many  orders  of  magnitude  but  the  resulting  state  space  is  still  too  large  to  be  fully 
explored  by  one  machine  in  24  hours.  In  the  case  of  multithreaded  workloads,  the 
large  number  of  schedules  stems  from  nondeterminism  in  scheduling  intra-process 
synchronizations  within  performance-critical  sections.  In  the  case  of  multiprocess 
workloads,  the  large  number  of  schedules  stems  from  nondeterminism  in  scheduling 
inter-process  communication. 
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All  in  all,  combining  restricted  runtime  scheduling  with  systematic  testing  greatly 
reduces  the  number  of  schedules  systematic  testing  needs  to  check,  increasing  the 
number  of  workloads  dBug  can  exhaustively  check  from  46  to  100  (out  of  108). 


6.4  Related  Work 

DMT  and  SMT  Systems 

Unlike  Parrot,  several  prior  systems  are  not  backward-compatible  because  they  re¬ 
quire  new  hardware  [42],  new  language  [23],  or  new  programming  model  and  OS  [7], 
Among  backward-compatible  systems,  some  DMT  systems,  including  Kendo  [109], 
CoreDet  [15],  and  CoREDET-related  systems  [16,  75],  improve  performance  by  bal¬ 
ancing  each  thread's  load  with  low-level  instruction  counts.  As  mentioned  in  the 
introduction  of  this  chapter,  DMT  systems  benefit  behavior  replication  but  do  not  solve 
the  testing  problem. 

Prior  deterministic  SMT  systems  reduce  the  testing  effort  in  addition  to  providing 
determinism.  However,  they  lack  a  mechanism  through  which  developers  could  tune 
their  performance  and  require  either  heavyweight  techniques  or  unrealistic  assumptions. 
Grace  [18]  requires  fork-join  parallelism.  Tern  [41]  and  Peregrine  [40]  record  and 
reuse  schedules.  Execution  recording  can  slow  down  the  execution  by  an  order  of 
magnitude,  and  computing  schedules  from  recorded  executions  relies  on  sophisticated 
source  code  analysis.  DThreads  [17]  stabilizes  schedules  by  ignoring  load  imbalance 
among  threads,  so  it  is  prone  to  the  serialization  problem  explained  in  Section  6.1. 

State  Space  Reduction 

Combining  dBug  with  Parrot  greatly  reduces  the  number  of  schedules  that  need  to  be 
checked,  as  such  this  approach  bears  similarity  to  state  space  reduction  techniques  [51, 
61,  67]  which  soundly  reduce  the  state  space  to  mitigate  the  state  space  explosion 
problem  of  model  checking.  Partial  order  reduction  [51,  61]  has  been  the  main  reduction 
technique  for  systematic  testing  tools  [137,  160],  including  dBug  and  is  discussed  in 
detail  in  Chapter  2  of  this  thesis.  Recently,  researchers  have  proposed  dynamic  interface 
reduction  [67]  that  checks  loosely  coupled  components  separately,  avoiding  expensive 
global  exploration  of  all  components.  However,  this  technique  has  yet  to  be  shown  to 
work  well  for  tightly  coupled  components  such  as  threads  frequently  communicating 
via  synchronizations  and  shared  memory. 

Using  restricted  runtime  scheduling  offers  three  advantages  over  previous  reduction 
techniques:  (1)  it  is  conceptually  simpler  because  it  does  not  rely  on  behavioral  equiva¬ 
lence  to  reduce  the  state  space;  (2)  in  the  absence  of  performance-critical  sections  and 
inter-process  synchronizations,  it  remains  effective  as  the  checked  system  scales;  and 
(3)  it  works  orthogonal  to  existing  state  space  reduction  techniques  [51,  61,  67]  used 
in  systematic  testing.  Thus,  it  can  be  combined  with  existing  reduction  techniques  to 
reduce  the  state  space  caused  by  nondeterministic  SMT. 
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Concurrency 

Automatic  mutual  exclusion  (AME)  assumes  all  shared  memory  is  implicitly  protected 
and  allows  advanced  developers  the  flexibility  to  remove  protection.  It  thus  shares  a 
similar  high-level  philosophy  with  restricted  runtime  scheduling.  The  differences  is  that, 
unlike  restricted  runtime  scheduling,  AME  has  only  been  implemented  in  simulation. 

Restricted  runtime  scheduling  is  orthogonal  to  much  prior  work  on  concurrency  error 
detection  [46, 96, 128, 169, 170],  diagnosis  [114, 115, 130],  and  correction  [78,  79, 157, 158]. 
By  reducing  the  number  of  schedules,  restricted  runtime  scheduling  potentially  benefits 
all  of  these  techniques. 


6.5  Conclusions 

In  conclusion,  this  chapter  presented  an  integration  of  restricted  runtime  scheduling 
with  systematic  testing.  To  this  end,  dBug  was  integrated  with  Parrot,  an  implementation 
of  restricted  runtime  scheduling  for  POSIX-compliant  operating  systems.  Parrot  offers  a 
new  contract  to  developers:  by  default,  it  schedules  synchronizations  using  round-robin, 
greatly  reducing  the  number  of  possible  schedules;  when  the  default  schedules  are 
slow,  it  allows  developers  to  use  performance  hints  to  improve  performance  by  allowing 
additional  schedules  to  be  used.  This  contract  benefits  testing  that  only  needs  to  focus 
on  schedules  allowed  by  Parrot.  This  benefit  has  been  demonstrated  by  using  dBug  to 
thoroughly  check  Parrot's  schedules.  Results  on  a  diverse  set  of  108  programs  show 
using  Parrot  reduces  the  testing  effort  required  of  dBug  by  many  orders  of  magnitude. 
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Chapter  7 

Abstraction  Reduction 


Abstraction  is  an  age-old  technique  used  in  formal  verification  [38]  to  combat  state 
space  explosion.  This  chapter  demonstrates  that  lifting  the  default  abstraction  of  dBug 
to  match  higher-level  coordination  interfaces  can  be  used  to  reduce  the  number  of 
schedules  systematic  testing  needs  to  check. 

The  reduction  is  realized  by  focusing  only  on  the  ordering  of  the  higher-level 
coordination  primitives  instead  of  the  ordering  of  the  lower-level  POSIX  interface 
primitives  that  implement  the  higher-level  coordination  primitives.  In  other  words, 
abstraction  reduction  focuses  systematic  testing  on  checking  whether  a  concurrent 
program  uses  a  higher-level  coordination  interface  correctly,  assuming  or  independently 
testing  that  the  implementation  of  the  interface  matches  its  specification. 

To  evaluate  the  benefits  of  abstraction  reduction,  dBug  was  extended  with  support 
for  intercepting  and  modeling  events  of  the  message-passing  interface  (MPI)  [103], 
an  interface  widely  used  in  the  high-performance  computing  community.  Programs 
from  the  NAS  Parallel  Benchmarks  (NPB)  suite  [8]  were  then  used  to  quantify  and 
contrast  the  number  of  schedules  dBug  needs  to  check  with  and  without  abstraction 
reduction  and  how  varying  the  coordination  interface  used  for  implementing  program 
functionality  between  OpenMP  [111]  and  MPI  affects  its  testability. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  7.1  provides  the  background 
necessary  for  understanding  abstraction  reduction.  Section  7.2  describes  the  design 
and  implementation  of  systematic  testing  of  MPI  through  dBug.  Section  7.3  evaluates 
our  implementation,  contrasting  the  number  of  schedules  of  both  OpenMP  and  MPI 
implementations  of  programs  from  the  NPB  benchmark  suite  [8].  Section  7.4  discusses 
related  work  and  Section  7.5  draws  conclusions. 


7.1  Background 

The  main  advantage  of  using  POSIX  interface  [119]  as  the  default  abstraction  for 
modeling  executions  of  concurrent  programs  is  that  it  enables  systematic  testing  of  a 
wide  range  of  programs  across  different  programming  languages. 
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Figure  7.1:  Interposition  based  on  Default  Abstraction 
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Figure  7.2:  Interposition  based  on  Custom  Abstraction 


The  main  drawback  of  the  default  abstraction  is  that  if  a  program  uses  a  higher- 
level  interface,  such  as  Apache  Thrift  [5],  Java  [6],  MPI  [103],  or  Python  [149],  that  is 
itself  implemented  using  the  POSIX  interface,  then  systematic  testing  based  on  the 
default  abstraction  will  control  and  order  scheduling  nondeterminism  in  both  the 
program  and  the  implementation  of  the  higher-level  interface,  a  scenario  illustrated 
in  Figure  7.1.  For  higher-level  interfaces  with  complex  implementations,  even  simple 
programs  can  experience  considerable  state  space  explosion.  For  example,  using  the 
mpich  implementation  of  the  MPI  specification,  dBug  equipped  with  dynamic  partial 
order  reduction  estimated  the  number  of  interleavings  of  concurrent  POSIX  interface 
invocations  for  a  trivial  MPI  program,  which  uses  two  processes  and  contains  no  logic 
besides  MPI  initialization  and  finalization  of  both  processes,  to  be  on  the  order  of  10107. 
This  is  more  than  the  estimated  number  of  the  atoms  in  the  universe. 

To  avoid  the  state  space  explosion  stemming  from  the  scheduling  nondeterminism  in 
the  implementation  of  a  higher-level  interface,  one  can  assume  that  the  implementation 
of  the  higher-level  interface  is  correct.  Equipped  with  this  assumption,  systematic 
testing  tools  can  intercept  and  model  invocations  of  the  higher-level  interface  primitives, 
abstracting  away  the  details  of  the  interface  implementation,  a  scenario  is  illustrated 
in  Figure  7.2.  The  benefit  of  using  a  custom  abstraction  lies  in  reducing  the  number 
of  schedules  systematic  testing  needs  to  consider,  focusing  systematic  testing  on  the 
interactions  of  a  program  with  the  higher-level  interface  as  opposed  to  interactions  an 
implementation  of  the  higher-level  interface  with  the  POSIX  interface.  The  drawback 
of  using  a  custom  abstraction  is  the  effort  required  to  provide  support  for  systematic 
testing  of  the  higher-level  interface  [147, 155]. 


88 


Specification 


Model 


Figure  7.3:  Different  Representations  of  Interface  Behaviors 


Notably,  modeling  a  specification  of  an  interface  instead  of  testing  its  implementation 
is  prone  to  human  error.  As  Figure  7.3  suggests,  a  specification,  an  implementation,  and 
a  model  each  define  a  set  of  behaviors  that  the  interface  can  exhibit.  Ideally,  these  three 
definitions  match. 

In  practice,  however,  for  reasons  such  as  performance  or  backward  compatibility, 
implementations  may  choose  to  exclude  behaviors  included  in  the  specification  or 
include  behaviors  excluded  from  the  specification.  For  example,  the  specification  of 
the  pthread_cond_signal  ( )  function  mandates  that  if  any  threads  are  blocked  on 
the  specified  condition  variable,  the  function  unblocks  at  least  one  of  the  threads.  To 
achieve  fairness,  an  implementation  of  the  conditional  variables  interface  may  choose  to 
wake  up  the  thread  that  has  been  waiting  the  longest.  However,  for  some  programs,  this 
implementation  might  allow  only  a  subset  of  the  behaviors  allowed  by  the  specification. 

Systematic  testing  based  on  abstraction  creates  its  own  model  of  an  interface,  defin¬ 
ing  a  set  of  behaviors  that  an  interface  can  exhibit,  and  uses  this  model  to  repre¬ 
sent  concrete  program  executions  that  use  an  implementation  of  the  interface.  If 
the  model  is  inaccurate,  the  inaccuracies  can  manifest  as  behaviors  allowed  by  the 
model  but  not  by  the  specification  or  vice  versa.  For  example,  the  specification  of  the 
pthread_barrier_wait  ()  function  mandates  that  when  the  required  number  of 
threads  have  called  the  function,  a  non-zero  constant  will  be  returned  to  one  of  the 
threads  and  zero  will  be  returned  to  all  remaining  threads.  If  a  model  does  not  capture 
this  nondeterminism,  its  application  may  fail  to  detect  program  errors. 
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7.2  Methods 


To  explore  the  idea  of  abstraction  reduction,  dBug  was  extended  with  support  for 
the  message-passing  interface  (MPI)  [103].  In  particular,  the  dBug  interposition  layer 
was  modified  to  intercept  invocations  of  MPI  primitives  and  the  dBug  arbiter  was 
modified  to  model  and  schedule  execution  of  these  invocations.  The  remainder  of  this 
section  first  describes  the  design  considerations  related  to  implementing  support  for 
systematic  testing  of  MPI  in  dBug  (§7.2.1)  and  then  highlights  notable  aspects  of  our 
implementation  of  this  design  (§7.2.2). 

7.2.1  Design 

An  important  design  decision  to  consider  when  extending  dBug  with  support  for 
MPI,  or  any  other  interface,  is  whether  to  use  dBug  only  as  a  scheduler,  delegating 
interposed  events  to  an  existing  implementation  of  the  interface,  or  whether  to  use  dBug 
to  implement  the  interface  program  logic,  replacing  an  existing  implementation.  The 
advantage  of  delegation  is  that  it  avoids  1)  the  effort  needed  to  implement  the  interface 
program  logic  inside  of  dBug  and  2)  the  risk  of  inaccurately  mapping  the  specification 
of  the  interface  to  an  implementation.  The  disadvantage  of  delegation  is  that  it  may 
limit  the  control  dBug  has  over  scheduling  of  program  threads  and  the  outcome  of  the 
interposed  events.  The  delegation  and  replacement  approaches  to  implementing  MPI 
program  logic  are  depicted  in  Figure  7.4  (a)  and  Figure  7.4  (b)  respectively. 

Our  experience  with  implementing  support  for  POSIX  interface  in  dBug  suggests  that 
delegation  is  appropriate  for  program  logic  that  does  affect  thread  scheduling,  such  as 
I/O  operations,  but  program  logic  that  affects  thread  scheduling  should  be  implemented 
in  dBug.  For  instance,  dBug  implements  the  program  logic  for  synchronization  events 
such  as  waitpid  ( )  or  epoll_wait  ( ) ,  while  execution  of  non-synchronization  events, 
such  as  fork  ( )  or  send  ( ) ,  are  delegated  to  the  libc  library  [92]. 

A  challenge  for  dBug's  support  for  MPI  is  that,  as  a  programming  convenience,  many 
MPI  primitives  mix  synchronization  and  non-synchronization  program  logic.  On  one 
hand,  these  primitives  should  be  implemented  by  dBug  to  retain  control  over  scheduling 
of  program  threads.  On  the  other  hand,  having  to  implement  non-synchronization 
logic,  such  as  data  movement  and  transformation  for  the  rich  set  of  MPI  data  types 
and  operations,  in  order  to  systematically  test  concurrency  seems  counter-intuitive  and 
unnecessarily  elaborate. 

To  resolve  this  problem,  our  design  chooses  to  layer  dBug  on  top  of  an  existing 
implementation  of  the  MPI  specification.  However,  instead  of  simply  delegating  complex 
MPI  primitives  that  mix  synchronization  and  non-synchronization  program  logic  to 
the  existing  implementation,  our  design  decomposes  these  primitives  into  a  set  of 
synchronization  and  non-synchronization  operations.  The  synchronization  operations 
are  then  implemented  in  dBug,  while  the  non-synchronization  operations  are  delegated 
to  the  existing  implementation.  The  decomposition  approach  to  implementing  MPI 
program  logic  is  depicted  in  Figure  7.4  (c). 
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(a)  Delegation  (b)  Replacement  (c)  Decomposition 

Figure  7.4:  Different  Approaches  to  Implementing  MPI  Program  Logic 


7.2.2  Implementation 

To  enable  systematic  testing  of  MPI,  the  dBug  interposition  layer  intercepts  invocations 
of  MPI  primitives  and  the  dBug  arbiter  models  and  schedules  execution  of  these  invoca¬ 
tions.  In  particular,  the  global  state  maintained  by  dBug  contains  objects  representing 
MPI  message  queues,  collective  communication  operations,  and  MPI  process  ranks.  The 
remainder  of  this  section  describes  notable  aspects  of  our  implementation.  A  detailed 
list  of  MPI  primitives  that  dBug  interposes  on  can  be  found  in  Appendix  B. 


Deterministic  Process  Naming 

To  facilitate  communication  between  different  processes  of  an  MPI  program,  the  MPI 
specification  defines  process  ranks  that  are  used  to  identify  processes  in  the  context  of  a 
communicator.  Communicators  provide  scope  for  group  communication  and  prevent 
communication  conflicts.  The  MPI  specification  defines  a  default  global  communicator 
that  can  used  to  identify  and  communicate  with  any  process  spawned  by  the  same  MPI 
program  launch. 

When  an  MPI  program  starts,  each  of  its  processes  calls  the  MPI_Init  ( )  function 
that  acts  as  a  barrier  and,  among  other  things,  assigns  the  calling  process  a  global 
communicator  process  rank.  This  process  rank  is  then  typically  used  to  identify  what 
program  logic  a  particular  MPI  process  should  execute  (cf.  Figure  3.3). 

To  enable  deterministic  replay,  dBug  assigns  deterministic  thread  identifiers  to 
threads  when  they  are  created  and  uses  these  identifiers  to  identify  threads  across 
different  executions.  Since  the  program  logic  of  MPI  processes  is  generally  determined 
by  their  global  process  rank,  it  is  important  for  the  mapping  between  dBug  thread 
identifiers  and  MPI  process  ranks  to  stay  the  same  across  different  executions. 
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However,  the  global  process  ranks  are  assigned  by  the  MPI_Init  ( )  function  which 
cannot  be  decomposed  into  simpler  MPI  primitives  and  our  design  requires  its  invo¬ 
cations  are  delegated  to  an  existing  implementation  of  MPI.  Fortunately,  the  global 
communicator  the  MPI_Init  ( )  function  creates  can  be  overridden.  In  particular,  after 
the  MPI_Init  ()  function  assigns  the  original  global  process  ranks,  dBug  creates  a 
separate  global  communicator,  with  ranks  ordered  according  to  its  deterministic  thread 
identifiers,  and  uses  this  communicator  in  place  of  the  default  one. 

Nondeterministic  Message  Matching 

One  of  the  sources  of  nondeterminism  in  the  MPI  specification  is  the  point-to-point 
communication  between  MPI  processes.  To  receive  a  message  from  a  particular  sender, 
an  MPI  process  may  call  the  MPI_Recv  ( )  function,  identifying  the  sender  from  which 
to  receive  a  message.  However,  this  function  also  allows  a  wildcard  in  place  of  a  sender 
identifier,  expressing  the  intent  to  receive  a  message  from  any  sender.  Although  the 
MPI  specification  mandates  that  messages  sent  from  the  same  sender  are  received  in  the 
order  they  were  sent,  messages  sent  from  different  senders  can  be  received  in  any  order. 

To  account  for  this  nondeterminism,  dBug  keeps  track  of  the  content  of  all  MPI 
message  queues  and  the  test  function  component  (cf.  Section  3.2.3)  of  the  MPl_Recv  ( ) 
function  model  computes  all  possible  matches  of  a  wildcard  invocation.  When  the  dBug 
arbiter  makes  a  scheduling  decision  involving  a  wildcard  invocation,  the  intended  match 
is  communicated  back  to  the  dBug  interposition  layer.  The  dBug  interposition  layer 
then  delegates  invocation  of  the  MPI_Recv  ( )  function  to  an  existing  implementation 
of  MPI,  replacing  the  wildcard  with  identifier  of  the  match. 

Non-blocking  Communication  Primitives 

In  addition  to  traditional  blocking  communication,  the  MPI  specification  facilitates 
latency  hiding  by  offering  non-blocking  versions  of  all  communication  primitives.  For 
example,  the  MPI_Irecv  ()  function  is  a  non-blocking  version  of  the  MPI_Recv  () 
function.  Instead  of  blocking  until  a  message  is  received,  it  returns  immediately, 
providing  a  handle  that  can  be  used  to  check  for  completion. 

To  retain  control  over  scheduling  of  MPI  operations  and  matching  of  MPI  messages, 
dBug  decomposes  the  program  logic  of  non-blocking  communication  primitives.  When 
the  dBug  interposition  layer  intercepts  an  invocation  of  a  non-blocking  communication 
primitive,  it  internally  spawns  a  thread  that  waits  to  be  scheduled  by  the  dBug  arbiter 
and  then  invokes  a  blocking  version  of  the  communication  primitive.  Once  the  child 
thread  is  spawned,  the  parent  thread  returns  a  handle  that  can  be  used  to  check  whether 
the  internally  spawned  thread  has  terminated. 

The  MPI  specification  offers  both  blocking  and  non-blocking  primitives  for  check¬ 
ing  whether  an  invocation  of  a  non-blocking  communication  primitive  has  completed. 
The  MPI_Wait  ( ) ,  MPI_Waitall  ( ) ,  MPI_Waitany  ( ) ,  and  MPI_Waitsome  ( )  func¬ 
tions  are  blocking  and  the  MPI_Test  (),  MPI_Testall  (),  MPI_Testany  ( ) ,  and 
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MPI_Testsome  ( )  functions  are  their  non-blocking  counterparts.  In  general,  the  func¬ 
tions  that  check  for  completion  can  return  one  of  several  values  and  the  corresponding 
test  function  components  of  the  dBug  arbiter  model  this  nondeterminism  and  enumerate 
all  possibilities. 


Collective  Communication  Primitives 

Besides  point-to-point  communication  primitives,  the  MPI  standard  also  specifies  a 
number  of  convenience  primitives  for  collective  communication.  These  primitives 
operate  over  a  collection  of  processes  and  represent  typical  communication  patterns. 
Based  on  the  communication  pattern,  these  primitives  can  be  divided  into  four  groups: 

•  all-to-all  primitives  move  data  between  every  pair  of  processes  involved  in  the 
communication 

•  all-to-one  primitives  move  data  from  all  processes  involved  in  the  communication 
to  one  distinct  process 

•  one-to-all  primitives  move  data  from  one  distinct  process  to  all  processes  involved 
in  the  communication 

•  scan  primitives  move  data  between  processes  involved  in  the  communication  in 
the  increasing  order  of  their  process  rank 

As  far  as  synchronization  is  concerned,  the  all-to-all  primitives  are  effectively  a 
barrier  and  the  scan  primitives  are  a  serialization.  In  other  words,  scheduling  of  these 
primitives  is  deterministic.  In  contrast  to  that,  the  all-to-one  and  one-to-all  primitives 
allow  a  number  a  scheduling  scenarios.  For  all-to-one  primitives  the  only  requirement 
is  that  the  process  receiving  data  returns  after  all  other  processes  invoke  the  primitive, 
while  for  one-to-all  primitives  the  only  requirement  is  that  the  process  sending  data 
invokes  the  primitive  before  all  other  processes  return.  Consequently,  all-to-one  and 
one-to-all  primitives  involving  n  processes  can  execute  in  (n  —  1) !  ways. 

However,  simply  delegating  the  all-to-one  and  one-to-all  primitives  to  an  existing 
implementation  of  MPI  does  not  work.  In  general,  an  implementation  of  MPI  may 
choose  to  introduce  additional  synchronization,  which  prevents  simple  delegation  from 
begin  able  to  exercise  all  possible  scenarios.  To  avoid  this  problem,  the  dBug  interposi¬ 
tion  layer  decomposes  these  primitives  to  a  collection  of  point-to-point  communication 
primitives.  The  point-to-point  communication  primitives  are  delegated  to  an  existing 
implementation  of  MPI,  while  dBug  implements  the  necessary  synchronization. 


7.3  Evaluation 

To  evaluate  the  benefits  of  using  a  custom  abstraction  of  a  higher-level  interface,  our 
implementation  of  support  for  MPI  in  dBug  was  evaluated  using  8  benchmarks  from 
the  NAS  Parallel  Benchmarks  (NPB)  [8]  suite. 
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Our  benchmark  selection  was  driven  by  the  fact  that  NPB  contains  both  MPI  and 
OpenMP  [ill]  implementations.  OpenMP  is  an  API  for  shared-memory  parallel  pro¬ 
gramming  and  its  specification  defines  a  collection  of  compiler  directives  that  can  be 
used  for  parallel  execution  of  otherwise  sequential  code.  Typically,  OpenMP  imple¬ 
mentations  translate  these  compiler  directives  to  pthreads  invocations,  which  enables 
systematic  testing  of  OpenMP  programs  in  dBug.  Our  choice  of  benchmarks  thus 
allowed  us  to  contrast  the  results  for  custom  abstraction  of  MPI  programs  against  the 
default  abstraction  of  both  MPI  and  OpenMP  programs. 

The  benchmarks  selected  for  our  evaluation  are  summarized  in  Table  7.1.  For  each 
benchmark,  the  table  lists  an  identifier,  a  short  description,  the  programming  language 
of  the  benchmark,  and  the  lines  of  code  of  the  MPI  and  OpenMP  implementations. 


Identifier 

Description 

Fanguage 

MPI 

OpenMP 

BT 

block  tri-diagonal  solver 

Fortran 

9,348 

5,295 

CG 

conjugate  gradient 

Fortran 

1,878 

1,262 

EP 

embarrassingly  parallel 

Fortran 

368 

297 

FT 

discrete  3D  fast  Fourier  transform 

Fortran 

2,172 

1,206 

IS 

integer  sort 

C 

1,150 

1,058 

LU 

lower-upper  Gauss-Seidel  solver 

Fortran 

5,797 

5,403 

MG 

multi-grid  on  a  sequence  of  meshes 

Fortran 

2,641 

1,533 

SP 

scalar  penta-diagonal  solver 

Fortran 

5,036 

3,385 

Table  7.1:  Overview  of  NAS  Parallel  Benchmarks 


7.3.1  Experimental  Setup 

Our  evaluation  was  carried  out  using  Susitna  nodes  from  PRObE  [58].  A  Susitna  node 
is  configured  with  64  cores  (quad  socket,  2.1  GHz  16-core  AMD  Opteron),  and  128 
GB  of  memory.  Our  experiments  used  Ubuntu  12.04,  the  mpich  library  version  3.0.4 
as  the  MPI  implementation,  the  GNU  libgomp  library  version  3.1  as  the  OpenMP 
implementation,  and  NPB  version  3.3.1. 

The  NPB  benchmarks  come  with  several  input  configurations.  Our  experiments  used 
the  "S"  configuration,  representing  a  small  instance,  for  all  the  benchmarks.  Even  for 
small  instances,  the  extent  of  state  space  explosion  is  considerable  and  the  effort  needed 
to  check  large  instances  stretches  beyond  the  practical  limits  of  existing  systematic 
testing  tools.  The  BT  and  SP  benchmarks  were  run  using  four  threads  because  their 
MPI  implementation  requires  the  number  of  threads  to  be  a  square  number.  All  other 
benchmarks  were  run  with  two  threads. 

For  each  benchmark,  dBug  was  used  to  systematically  explore  different  interleav¬ 
ings  of  interposed  events  using  a  number  of  different  configurations.  The  OpenMP 
implementation  was  tested  using  the  default  abstraction  and  the  MPI  implementation 
was  tested  using  both  the  default  and  custom  abstraction.  To  evaluate  the  interaction 
between  abstraction  reduction  and  dynamic  partial  order  reduction  (DPOR)  [51],  all 
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Figure  7.5:  Contrasting  Custom  and  Default  Abstraction 

our  experiments  were  carried  out  both  with  DPOR  enabled  and  disabled.  Finally, 
each  configuration  was  run  in  dBug  until  all  interleavings  were  explored  or  a  24-hour 
timeout  was  reached,  in  which  case  dBug  used  the  lazy  strategy,  the  weighted-backtrack 
estimator,  and  the  empty  fit  (cf.  Chapter  4)  to  estimate  the  total  runtime. 

7.3.2  Results 

The  results  of  our  experiments  are  presented  as  graphs  in  Figures  7.5  and  7.6.  The 
horizontal  axes  of  these  graphs  identify  the  benchmark  and  the  reduction  configuration. 
The  log-scale  vertical  axes  of  these  graphs  identify  the  runtime.  Red  values  represent 
runtime  estimates  for  experiments  that  timed  out  after  24  hours,  while  green  values 
represent  actual  runtimes  for  experiments  that  finished  in  less  than  24  hours. 

Figure  7.5  presents  a  comparison  of  the  default  and  custom  abstraction  of  MPI 
programs.  These  results  demonstrate  that  combining  custom  abstraction  and  DPOR 
is  able  to  reduce  the  number  of  interleavings  of  interposed  events  to  a  set  that  can  be 
exhaustively  enumerated  for  6  out  of  8  benchmarks.  As  it  turns  out,  the  semantics  of  MPI 
primitives  used  by  these  benchmarks  allow  only  interleavings  that  are  all  considered 
equivalent  by  DPOR,  thus  reducing  the  state  space  down  to  a  single  equivalence 
class.  This  compelling  result  highlights  the  synergy  between  abstraction  reduction  and 
dynamic  partial  order  reduction. 
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Figure  7.6:  Contrasting  MPI  and  OpenMP 

Figure  7.6  presents  a  comparison  of  the  custom  abstraction  of  MPI  programs  and 
the  default  abstraction  of  OpenMP  programs.  This  comparison  reveals  that  using  a 
higher-level  interface  to  implement  program  logic  in  combination  with  abstraction 
reduction  can  help  systematic  testing  tools  thoroughly  check  the  program  logic.  For 
OpenMP  implementations,  which  libgomp  implements  through  pthreads  primitives, 
using  dBug  with  DPOR  fails  to  avoid  state  space  explosion  of  possible  interleavings  of 
pthreads  primitives  for  any  of  the  benchmarks.  In  contrast,  for  MPI  implementations, 
using  dBug  with  abstraction  reduction  and  DPOR  avoids  state  space  explosion  of 
possible  interleavings  of  MPI  primitives  for  6  out  of  8  of  the  benchmarks. 

7.4  Related  Work 

Abstraction 

The  seminal  abstract  interpretation  paper  by  Cousot  and  Cousot  [38]  provides  a  formal 
framework  for  describing  program  abstractions  and  establishes  conditions  under  which 
a  program  property  can  be  studied  using  a  program  abstraction.  Our  approach  to 
systematic  testing  of  scheduling  nondeterminism  can  be  thought  of  as  abstract  interpre¬ 
tation  of  programs.  A  key  aspect  of  the  abstractions  used  in  our  work  is  that  they  model 
the  state  necessary  for  controlling  input  and  scheduling  nondeterminism,  enabling  for 
deterministic  replay  and  systematic  state  space  exploration. 
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The  work  of  Kurshan  [86]  demonstrates  how  complex  systems  may  be  abstractly 
modeled  as  a  collection  of  many  small  state  machines.  This  concept  is  similar  to  our 
work  that  essentially  views  each  intercepted  interface  event  as  a  simple  state  machine 
that  interacts  with  other  events  through  an  abstraction  of  the  global  state. 

Clarke  et  al.  study  abstraction  interpretation  in  the  context  of  model  checking  [34] 
and  demonstrate  that  combining  abstraction  with  symbolic  representation  [26]  can 
scale  hardware  verification  to  designs  with  up  to  lO1300  states.  In  contrast  to  model 
checking,  our  work  checks  behaviors  of  the  actual  programs  and  uses  abstraction  to 
group  interleavings  of  concrete  program  transitions  into  equivalence  classes. 


Tools 

Given  the  widespread  popularity  of  MPI  in  the  high-performance  computing  commu¬ 
nity,  it  is  not  surprising  that  a  fair  amount  of  research  has  focused  on  helping  software 
engineers  to  test  and  debug  MPI  programs. 

The  Umpire  tool  [152]  uses  the  MPI  profiling  layer  to  monitor  an  execution  of 
an  MPI  program  and  check  this  execution  for  a  set  of  correctness  properties.  The 
mechanism  Umpire  uses  to  monitor  a  program  execution  is  similar  to  the  one  used 
by  the  dBug  interposition  layer.  In  contrast  to  our  work.  Umpire  does  not  control 
scheduling  nondeterminism  in  program  executions  or  enumerate  different  scheduling 
scenarios. 

The  MPI-SPIN  tool  [133]  is  an  MPI  extension  of  the  stateful  model  checker  SPIN  [72]. 
The  main  drawback  of  MPI-SPIN  is  that,  although  SPIN  supports  a  subset  of  C,  in 
general,  engineers  that  want  to  check  their  MPI  programs  with  MPI-SPIN  need  to  create 
Promela  [70]  models  of  their  MPI  programs. 

The  TASS  toolkit  [134]  is  a  suite  of  integrated  tools  to  model  and  formally  analyze 
parallel  programs.  Notably,  TASS  uses  symbolic  expressions  to  represent  program 
inputs  and,  similar  to  other  symbolic  execution  tools  [27,  29],  systematically  enumerates 
non-equivalent  inputs.  Compared  to  our  work,  TASS  only  supports  programs  written 
in  a  subset  of  C,  limiting  the  scope  of  its  applicability  for  testing  of  MPI  programs. 

The  ISP  tool  [118,  147,  148]  is  in  spirit  very  similar  to  dBug's  support  for  MPI.  The 
main  difference  is  that  ISP  has  only  been  demonstrated  to  work  on  toy  examples  [147] 
and  supports  a  small  subset  of  all  MPI  primitives.  Further,  its  authors  argue  that 
the  standard  happens-before  relation  [88]  and  dynamic  partial  order  reduction  [51] 
are  inapplicable  to  MPI  programs,  furnishing  ISP  with  custom  alternatives  to  these 
time-tested  techniques. 

The  DAMPI  tool  [155,  156]  is  a  successor  of  the  ISP  tool,  inheriting  most  of  its 
limitations.  The  main  innovation  of  DAMPI  is  that  it  offers  parallel  exploration  capa¬ 
bilities  and  better,  yet  still  non-standard,  happens-before  tracking.  In  contrast  to  ISP, 
the  evaluation  of  DAMPI  demonstrates  feasibility  of  systematic  testing  for  complex 
MPI  programs.  However,  lacking  the  ability  to  change  levels  of  abstraction  or  estimate 
length  of  systematic  tests,  DAMPTs  evaluation  focuses  on  measuring  runtime  overhead. 
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7.5  Conclusions 


This  chapter  explored  the  potential  of  abstraction  to  help  mitigate  state  space  explosion 
stemming  from  scheduling  nondeterminism.  The  key  insight  of  this  chapter  is  that  the 
boundary  between  a  program  and  its  environment  separates  the  program  logic  to  be 
tested  by  systematic  testing  from  the  program  logic  of  library  services  that  are  assumed 
to  be  independently  tested.  Using  this  insight,  this  chapter  studied  the  effects  of  treating 
an  implementation  of  a  higher-level  inter-process  communication  interface  as  a  part  of 
the  environment  as  opposed  to  a  part  of  the  program. 

In  particular,  this  chapter  explored  this  idea  by  implementing  and  evaluating  ab¬ 
straction  reduction  for  the  message-passing  interface  (MPI)  [103],  an  inter-process 
communication  interface  widely  used  in  the  high-performance  computing  community. 
To  this  end,  our  systematic  testing  tool  dBug  was  extended  with  support  for  intercepting 
and  modeling  events  at  the  MPI  level  of  abstraction.  The  NAS  Parallel  Benchmarks 
(NPB)  suite  [8]  was  then  used  to  contrast  the  extent  of  state  space  explosion  for  system¬ 
atic  testing  based  on  the  POSIX  interface  abstraction  to  the  MPI  abstraction.  In  addition, 
both  MPI  and  OpenMP  implementations  of  NPB  benchmarks  were  considered,  enabling 
a  comparison  between  different  programming  abstractions. 

Our  experimental  results  confirmed  our  intuition  that  using  abstraction  helps  to 
mitigate  state  space  explosion  stemming  from  scheduling  nondeterminism.  What  was 
surprising  was  the  extent  of  the  realized  reduction;  for  the  benchmarks  considered 
in  our  evaluation  the  reduction  ranged  between  10106  and  101773.  Our  evaluation  also 
generated  evidence  of  synergy  between  abstraction  reduction  and  dynamic  partial 
order  reduction  [51].  Namely,  the  combination  of  these  two  reduction  techniques 
correctly  infers  that  for  6  out  of  8  of  the  benchmarks  considered  by  our  evaluation,  all 
interleavings  of  the  interposed  events  can  be  grouped  into  a  single  equivalence  class. 
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Chapter  8 
Related  Work 


While  the  previous  chapters  discussed  work  closely  related  to  the  research  undertaken 
by  each  chapter,  this  chapter  provides  a  broader  view,  helping  to  place  the  sum  of 
the  research  carried  out  by  this  thesis  in  the  context  of  related  work.  To  this  end,  the 
subject  of  this  thesis  is  characterized  as  "systematic  testing  of  distributed  and  multithreaded 
programs  using  dynamic  analysis  and  stateless  exploration  to  check  for  safety  properties" 
and  this  characterization  is  used  to  identify  dimensions  for  comparison  to  related  work. 


8.1  Testing  /  Offline  Debugging  /  Online  Debugging 

An  important  concern  of  distributed  and  multithreaded  programs,  not  addressed  in  this 
thesis,  is  how  to  handle  faults  that  are  detected  when  the  program  is  already  deployed. 
An  important  requirement  for  these  methods  is  that  they  add  only  negligible  overhead 
to  the  runtime  of  a  program.  Offline  debugging  meets  this  requirement  by  using  a 
lightweight  mechanism  for  recording  a  compact  log  of  key  execution  events.  The  idea 
behind  logging  these  events  is  to  enable  a  deterministic  replay  and  analysis  of  possibly 
faulty  executions  offline.  For  instance,  the  WiDS  checker  [94]  achieves  deterministic 
replay  through  program  annotation,  while  the  Friday  tool  [54]  uses  the  liblog  [55] 
infrastructure  to  provide  for  deterministic  replay  of  unmodified  distributed  programs. 
More  recently,  Altekar  et  al.  [3]  have  shown  how  to  enable  deterministic  replay  for 
multiprocessor  architectures. 

In  contrast  to  offline  debugging,  online  debugging  analyzes  the  execution  of  a 
program  while  the  program  is  running.  For  instance,  the  D3S  [93]  project  uses  binary 
instrumentation  in  Box  [68]  to  extend  legacy  distributed  and  multithreaded  programs 
with  filters  that  stream  execution  information  to  a  background  checker  that  monitors 
correct  execution  of  the  program  and  can  check  for  system-wide  invariants. 

In  comparison  to  debugging,  the  methods  and  tools  presented  in  this  thesis  are 
intended  to  be  used  for  in-house  software  testing  before  the  software  is  deployed. 
Notably,  instead  of  simply  monitoring  an  execution,  systematic  testing  methods  strive 
to  achieve  good  coverage  by  systematically  exploring  different  behaviors  of  the  system. 
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8.2  Distributed  /  Multithreaded  /  Sequential  Programs 


Advances  in  testing  and  verification  of  sequential  programs  permeate  the  history  of 
computer  science.  Early  work  by  King  [83]  identified  the  value  of  symbolic  execution 
as  a  mechanism  for  systematically  generating  test  cases  that  explores  distinct  branches 
and  paths  of  a  computer  program.  This  mechanism  prompted  the  creation  of  symbolic 
evaluation  tools  such  as  Dissect  as  early  as  1977  [74],  In  recent  years,  researchers 
have  exploited  advancements  in  automated  constraint  solving  and  proven  automated 
high-coverage  testing  of  unmodified  sequential  programs  to  be  not  only  feasible,  but 
also  practical  [27,  29,  64,  131]. 

Systematic  testing  of  multithreaded  programs  was  pioneered  by  Godefroid  in  his 
work  on  VeriSoft  [61,  62],  Interestingly,  VeriSoft  also  contained  experimental  support 
for  remote  processes  that  communicated  with  a  centralized  scheduler  using  a  local 
proxy  process.  To  the  best  of  our  knowledge,  this  constitutes  the  first  attempt  at 
systematic  testing  of  distributed  and  multithreaded  programs.  More  recently,  the  CHESS 
tool  [107]  implemented  systematic  testing  for  unmodified  multithreaded  Windows- 
specific  programs. 

Though  in  theory  both  multithreaded  and  distributed  program  are  concurrent 
systems,  in  practice  the  methods  for  systematic  testing  of  each  are  far  from  identical. 
First,  unlike  a  multithreaded  program  that  runs  in  the  context  of  a  single  process  and 
operating  system,  a  distributed  program  generally  has  no  centralized  entity  that  would 
already  hold  a  global  view  of  the  program  state.  Second,  multithreaded  programs 
typically  coordinate  via  shared  state,  while  distributed  program  often  coordinate  via 
both  shared  state  and  message  passing.  The  latter  difference  plays  an  important  role  in 
methods  for  systematic  testing  of  unmodified  programs,  which  in  the  case  of  distributed 
programs  need  to  account  for  additional  communication  and  coordination  primitives. 

Systematic  testing  of  unmodified  distributed  programs  was  first  investigated  in  detail 
by  the  MaceMC  [82]  tool.  The  limitation  of  MaceMC  is  that  it  only  handles  programs 
written  in  the  Mace  [81]  programming  language.  More  recently,  the  MoDist  [160]  tool 
extended  the  same  concept  to  unmodified  legacy  distributed  programs  written  for  the 
Windows  platform.  The  MoDist  tool  relies  on  automated  instrumentation  of  Windows 
API  [68]  and  uses  this  instrumentation  to  explore  different  execution  orders,  inject  faults, 
and  simulate  timeouts. 

The  methods  and  tools  presented  in  this  thesis  push  the  limits  of  previous  approaches 
in  several  directions.  State  space  estimation  enables  quantification  of  systematic  testing 
progress  and  coverage,  parallel  state  space  exploration  scales  systematic  testing  to  large 
computational  clusters,  and  restricted  runtime  scheduling  and  abstraction  reductions 
scale  systematic  testing  to  more  complex  programs.  Further,  dBug  enables  systematic 
testing  irrespective  of  programming  languages  and  instruction  sets.  The  only  prerequi¬ 
site  of  dBug  is  that  the  underlying  system  is  POSIX-compliant  and  supports  runtime 
interposition.  Once  this  prerequisite  is  met,  one  can  use  dBug  to  systematically  test 
unmodified  binaries  of  any  program. 
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8.3  Dynamic  /  Static  Analysis 

In  general,  certain  programming  concepts  such  as  pointer  arithmetic,  function  pointers, 
or  library  calls,  all  of  which  are  used  frequently  in  systems  code,  represent  obstacles 
for  the  use  of  static  analysis  tools.  These  obstacles  are  typically  overcome  by  making 
assumptions  about  the  programming  language  and  constructs  of  the  system,  the  en¬ 
vironment  in  which  the  systems  runs,  or  both.  Unlike  static  analysis  tools,  dynamic 
analysis  tools  do  not  make  assumptions  about  either  the  system  or  the  environment. 
Instead,  the  dynamic  analysis  runs  the  actual  system  within  some  specific  environment. 
The  advantage  of  a  dynamic  analysis  is  that  it  observes  an  actual  behavior  of  the  sys¬ 
tem.  The  disadvantage  is  that  its  observations  are  made  in  the  context  of  a  specific 
environment. 

Static  analysis  of  programs  for  concurrency  faults  has  been  an  active  area  of  research 
for  decades.  A  notable  recent  research  effort  produced  the  HAVOC  tool  [87]  for 
automatic  detection  of  faults  in  multithreaded  C  programs.  To  combat  the  state  space 
explosion,  the  exploration  algorithm  of  HAVOC  explores  possible  context-switching 
scenarios  only  to  a  certain  depth  -  a  technique  known  as  context-bounding  [106].  The  tool 
has  been  used  to  find  faults  in  Windows  device  drivers.  Interestingly,  the  experiments 
required  a  creation  of  a  model  of  the  Windows  operating  system  environment. 

Besides  detecting  faults,  static  analyses  are  also  used  to  prevent  faults  by  providing 
compile  time  guarantees.  For  example,  Engler  et  al.  [47]  have  demonstrated  how  to  write 
system-specific  checks  that  prevent  certain  types  of  faults  from  happening.  Reflecting 
on  their  experience  a  decade  later  [19],  the  authors  identified  the  ability  to  avoid  false 
positives  as  the  most  important  step  towards  practicality  of  static  analysis. 

The  methods  and  tools  presented  in  this  thesis  refrain  from  the  use  of  static  analysis 
in  order  to  avoid  false  positives  and  programming  language  dependency. 


8.4  Stateless  /  Stateful  Search 

All  systematic  testing  tools  need  a  mechanism  for  navigating  the  space  of  possible 
program  states.  While  going  forward  can  be  achieved  by  simply  executing  instructions 
of  the  program,  going  backward  is  not  as  straightforward  and  one  needs  to  store  the 
program  state.  In  principle,  this  can  be  achieved  either  by  storing  the  state  explicitly 
(, stateful  search),  or  implicitly  as  a  sequence  of  nondeterministic  choices  leading  to  the 
state  from  some  previous  state  ( stateless  search)  -  trading  space  for  time.  A  separate 
concern  of  the  stateless  approach  is  the  need  to  reconstruct  the  initial  state. 

Examples  of  methods  implementing  the  stateful  search  include  the  C  Model  Checker 
(CMC)  [105],  the  FiSC  tool  [165],  or  the  KLEE  tool  [29].  These  tools  store  a  portion 
of  the  in-memory  and  on-disk  state  so  that  they  can  later  reconstruct  this  state.  In 
contrast  to  that,  the  stateless  search  was  first  proposed  in  VeriSoft  [61]  and  later  adopted 
by  the  CHESS  [107]  verification  tool.  Both  of  these  tools  assume  the  existence  of  an 
initialization  function  that  reconstructs  the  initial  state. 


101 


The  advantage  of  the  stateful  approach  is  the  speed  of  state  reconstruction  at  the 
expense  of  the  space  needed  to  store  the  information  needed  for  the  reconstruction.  The 
advantage  of  the  stateless  approach  is  the  small  space  requirement  needed  to  store  a 
state  at  the  cost  of  needing  to  replay  part  of  the  execution.  Notably,  stateful  searches 
typically  first  run  out  of  memory,  while  stateless  searches  typically  first  run  out  of  time. 
Finally,  some  tools  such  as  the  SPIN  model  checker  [71]  try  to  balance  the  time  and 
space  requirements  by  combining  the  stateful  and  the  stateless  approaches. 

Similar  to  VeriSoft,  the  methods  presented  in  this  thesis  use  a  stateless  exploration 
and  assume  existence  of  a  mechanism  that  sets  up  the  initial  exploration  state. 


8.5  Safety  /  Liveness  /  Data  Race  Checkers 

Informally,  a  safety  property  is  defined  to  be  a  property  that  can  be  violated  by  a  finite 
execution  of  a  program,  while  a  liveness  property  is  defined  to  be  a  property  that  can 
only  be  violated  by  an  infinite  execution  of  a  program.  A  data  race  occurs  when  two  or 
more  threads  perform  non-commutative  data  operations  concurrently.  Although  many 
data  races  are  intentional  or  benign,  some  data  races  can  corrupt  the  program  state  or 
lead  to  a  crash  [122]. 

With  the  exception  of  MaceMC  [82],  execution-based  checkers  search  for  violation  of 
safety  properties  only.  The  reason  for  this  is  that  the  checking  of  an  unmodified  legacy 
program  for  liveness  properties  is  generally  undecidable  [121].  Although  there  has  been 
some  progress  on  automated  proving  of  liveness  properties  [36,  37],  the  state  of  the  art 
methods  do  not  scale  to  the  size  and  complexity  of  unmodified  legacy  programs. 

Given  the  wide-spread  occurrence  of  data  races,  a  number  of  tools  focus  on  detection, 
avoidance,  and  prediction  of  data  races  dating.  Notable  early  work  on  data  race 
detection  includes  Mellor-Crummey's  on-the-fly  detection  of  data  races  in  fork-join 
programs  [101]  and  Eraser  [128],  a  tool  for  detecting  data  races  in  legacy  programs. 
Further  research  in  this  space  produced  both  static  [46]  and  dynamic  [25,  50,  169] 
methods  for  detecting  data  races.  Many  of  these  tools  make  use  of  the  happens-before 
relation  [88],  a  powerful  mechanism  for  reasoning  about  concurrent  program  events.  The 
improvements  in  data  race  detection  tools  tend  to  reduced  both  the  false  positive  rate 
and  the  runtime  overhead.  Recently,  a  light-weight  form  of  memory  access  sampling  [48] 
has  been  shown  to  be  effective  at  detecting  data  races  in  the  Windows  operating  system 
kernel. 

The  methods  and  tools  presented  in  this  thesis  focus  on  safety  properties  and,  for  the 
reasons  discussed  in  Chapter  3,  dBug  considers  program  transitions  at  the  granularity  of 
function  call  interleavings.  Consequently,  the  methods  and  tools  presented  in  this  thesis 
focus  on  coarse-grained  concurrency  errors  such  as  deadlocks  or  incorrect  uses  of  an 
API.  Fortunately,  many  of  the  existing  data  race  checkers  [25,  128,  169]  are  compatible 
with  systematic  testing  of  dBug  and  can  be  use  to  increase  its  precision. 


102 


Chapter  9 
Conclusions 


This  thesis  makes  two  important  points:  1)  it  demonstrates  the  practicality  of  the 
systematic  approach  to  testing  of  concurrent  programs,  and  2)  it  quantifies  and  advances 
the  practical  limits  of  systematic  testing  of  concurrent  programs. 

The  practicality  of  the  systematic  testing  approach  is  demonstrated  by  presenting 
two  tools.  ETA  [138]  is  a  tool  that  targets  systematic  testing  of  components  of  the  Omega 
cluster  management  system  [129],  while  dBug  [137]  is  a  tool  that  targets  systematic 
testing  of  unmodified  binaries  of  concurrent  programs  written  for  POSIX-compliant 
operating  systems.  These  tools  were  used  to  systematically  test  over  100  different 
concurrent  programs  written  in  a  number  of  languages,  including  C,  C++,  and  Fortran, 
spanning  a  range  of  programming  paradigms,  such  as  actors  [2],  pthreads  [120], 
OpenMP  [  i  ll],  and  MPI  [103]. 

To  measure  and  advance  the  practical  limits  of  systematic  testing,  this  thesis  presents 
novel  results  in  several  research  directions  ranging  from  state  space  estimation  to 
parallel  state  space  exploration  to  state  space  reduction. 

To  measure  the  extent  of  state  space  explosion,  this  thesis  pioneers  research  on  state 
space  estimation  of  systematic  tests,  presenting  a  number  of  techniques  that  can  be  used 
for  predicting  the  length  of  a  systematic  test.  The  techniques  are  independent  of  the 
exploration  algorithm  and  thus  compatible  with  a  wide  range  of  existing  systematic 
testing  tools.  The  techniques  have  been  implemented  in  both  ETA  and  dBug  and  used 
to  estimate  the  extent  of  state  space  explosion  for  over  100  different  programs.  The 
evaluation  of  these  techniques  demonstrated  that  the  techniques  achieve  good  accuracy 
and  that  the  they  can  be  used  to  drive  efficient  allocation  of  testing  resources. 

To  speed  up  long-running  systematic  tests,  this  thesis  improves  on  previous  work  on 
efficient  state  space  exploration  algorithms,  enabling  systematic  testing  on  large  scale 
computational  clusters.  In  particular,  our  scalable  implementation  of  dynamic  partial 
order  reduction  [139],  a  state  of  the  art  algorithm  for  efficient  state  space  exploration, 
has  been  demonstrated  to  achieve  strong  scaling  on  a  cluster  of  over  1,000  machines, 
exploring  millions  of  different  test  executions  in  a  matter  of  minutes. 

To  mitigate  state  space  explosion  for  multithreaded  programs,  this  thesis  combines 
systematic  testing  with  restricted  runtime  scheduling.  To  this  end,  dBug  has  been 
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integrated  with  Parrot,  an  implementation  of  restricted  runtime  scheduling  for  POSIX- 
compliant  operating  systems  [39].  The  integration  takes  advantage  of  modular  design  of 
both  of  the  tools  and  is  accomplished  through  a  simple  coordination  API.  The  end  result 
demonstrates  the  synergy  between  systematic  testing  and  restricted  runtime  scheduling. 
Systematic  testing  checks  the  schedules  allowed  by  restricted  runtime  scheduling,  while 
restricted  runtime  scheduling  reduces  the  number  of  schedules  systematic  testing  needs 
to  check.  Arguably,  the  state  space  reduction  accomplished  by  combining  systematic 
testing  with  restricted  runtime  scheduling  represents  a  leap  forward  in  combating  the 
state  space  explosion  problem.  Notably,  unlike  previous  work  [51,  67],  this  reduction 
technique  scales  to  many  threads  and  long  test  executions. 

To  mitigate  state  space  explosion  for  multiprocess  programs,  this  thesis  makes  use 
of  abstraction  and  demonstrates  how  high-level  inter-process  coordination  APIs  such 
as  MPI  [103]  can  be  used  to  soundly  abstract  executions  of  multiprocess  programs, 
reducing  the  number  of  abstract  program  states  that  need  to  be  examined.  To  demon¬ 
strate  the  practical  potential  of  reduction  through  abstraction,  dBug  has  been  extended 
with  support  for  MPI.  Our  evaluation  shows  that  abstraction  reduction  is  orthogonal 
to  other  reduction  techniques,  such  as  dynamic  partial  order  reduction  [51],  and  their 
combination  helps  to  thoroughly  test  multiprocess  programs. 

As  for  future  work,  there  are  several  directions  in  which  the  research  exploration 
carried  out  by  this  thesis  could  be  followed  up.  First,  given  the  value  provided  by  state 
space  estimates,  it  would  be  interesting  to  see  if  their  accuracy  can  be  improved  by 
adopting  a  machine  learning  approach  that  learns  the  typical  structure  of  execution  trees 
of  real-world  programs  over  time.  Second,  given  the  benefit  of  combining  systematic 
testing  with  restricted  runtime  scheduling  of  intra-process  synchronizations,  it  would  be 
interesting  to  see  whether  a  similar  path  could  be  taken  for  restricted  runtime  scheduling 
of  inter-processes  synchronizations.  The  main  challenge  of  this  research  direction  is 
the  creation  of  a  practical  runtime  for  deterministic  and  stable  multiprocessing.  Third, 
given  the  wide  spectrum  of  possible  execution  tree  exploration  strategies,  it  would  be 
interesting  to  carry  out  a  study  that  considers  a  sizable  set  of  concurrent  programs 
and  contrasts  different  exploration  strategies  using  metrics  such  as  time  and  space 
complexity,  estimate  accuracy  convergence,  and  bug  coverage. 

In  conclusion,  this  thesis  provides  strong  evidence  in  support  of  its  statement 
that  existing  concurrent  software  can  be  tested  in  a  scalable  and  systematic  fashion 
using  testing  infrastructure  which  controls  nondeterminism  and  mitigates  state  space 
explosion. 
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Appendix  A 
dBug  and  POSIX 


The  following  list  enumerates  92  functions  from  the  POSIX  interface  [119]  for  which 
dBug  implements  non-trivial  handlers: 

•  Barriers: 

■  pthread_barrier_init  ■  pthread_barrier_wait 

■  pthread_barrier_dest roy 

•  Condition  Variables: 

■  pthread_cond__broadcast  ■  pthread_cond_signal 

■  pthread_cond_destroy  ■  pthread_cond__t  imedwait 

■  pthread_cond_init  ■  pthread_cond_wait 


I/O  Functions: 

■  accept 

■  open 

■  bind 

■  pipe 

■  close 

■  poll 

■  dup 

■  read 

■  dup  2 

■  readv 

■  epoll_create 

■  recv 

■  epoll_createl 

■  recvfrom 

■  epoll_ctl 

■  send 

■  epoll_wait 

■  sendto 

■  fcntl 

■  select 

■  listen 

■  socket 
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■  socketpair 

■  unlink 

•  Mutexes: 

■  pthread_mutex_destroy 

■  pthread_mutex_init 

■  pthread__mutex_lock 

•  Non-reentrant  Functions: 

■  gethostbyaddr 

■  gethostbyname 

•  Processes: 

■  execl 

■  execv 

■  execle 

■  execve 

■  execlp 

■  execvp 

■  exit 

■  _exit 

•  Read-write  Locks: 

■  pthread_rwlock_dest roy 

■  pthread_rwlock_init 

■  pthread_rwlock__rwlock 

■  pthread_rwlock_tryrdlock 

■  pthread_rwlock_t imedrdlock 

•  Scheduling  and  Time: 

■  clock 

■  gettimeofday 

■  nanosleep 

■  sleep 


■  write 

■  writev 

■  pthread_mutex_t imedlock 

■  pthread_mutex_t rylock 

■  pthread__mutex_unlock 

■  inet_ntoa 

■  strtok 

■  fork 

■  posix_spawn 

■  posix_spawnp 

■  setpgid 

■  setpgrp 

■  wait 

■  waitpid 


■  pthread_rwlock_t imedwrlock 

■  pthread_rwlock__t  rywrlock 

■  pthread_rwlock_unlock 

■  pthread_rwlock_wrlock 


■  usleep 

■  sched_yield 

■  time 
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Semaphores: 


■  sem_close 

■  sem_destroy 

■  sem_init 

■  sem_open 

•  Spin  Locks: 

■  pthread_spin_dest roy 

■  pthread_spin_init 

■  pthread_spin_lock 

•  Threads: 

■  pthread_cancel 

■  pthread_create 

■  pthread_detach 


■  sem_post 

■  sem_unlink 

■  sem_wait 


■  pthread_spin_trylock 

■  pthread_spin_unlock 


■  pthread_exit 

■  pthread_join 
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Appendix  B 
dBug  and  MPI 


The  following  list  enumerates  58  functions 
implements  non-trivial  handlers: 

•  Collective  Communication: 

■  MPI_Allgather 

■  MPI_Allgatherv 

■  MPI_A11 reduce 

■  MPI_Alltoall 

■  MPI_Alltoallv 

■  MPI_Alltoallw 

■  MPI_Barrier 

■  MPI_Bcast 

■  MPI_Exscan 

•  Communicator  Management : 

■  MPI_Cart_sub 

■  MPI_Comm_compare 

■  MPI_Comm_create_group 

■  MPI_Comm_dup 

■  MPI_Comm_dup_with_info 


from  the  MPI  standard  [10  ]  for  which  dBug 


■  MPI_Gather 

■  MPI_Gatherv 

■  MPI_Reduce 

■  MPI_Reduce_scatter 

■  MPI_Reduce_scatter_block 

■  MP I_Scan 

■  MPI_Scatter 

■  MPI_Scatterv 


■  MPI_Comm_free 

■  MPI_Comm_group 

■  MPI_Comm_split 

■  MPI_Comm_split_type 
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Non-Blocking  Point-To-Point  Communication: 


■  MPI_Ibsend 

■  MPI_Imrecv 

■  MPI_Improbe 

■  MPI_Iprobe 

■  MPI_Irecv 

■  MPI_Irsend 

■  MPI_Isend 

■  MPI_Issend 

■  MPI_Request_f ree 

•  Point-To-Point  Communication: 

■  MPI_Bsend 

■  MPI_Buf fer_detach 

■  MPI_Mprobe 

■  MPI_Mrecv 

■  MPI_Probe 

■  MPI_Recv 

•  Process  Management: 

■  MPI_Finalize 

■  MP I  Init 


■  MP I_Test 

■  MPI_Test_all 

■  MPI_Test_any 

■  MPI_Test_some 

■  MP I_Wait 

■  MP I_Wait_al 1 

■  MPI_Wait_any 

■  MPI_Wait_some 


■  MPI_Rsend 

■  MP I_Send 

■  MPI_Send_recv 

■  MPI_Send_recv_replace 

■  MPI_Ssend 


■  MPI_Init_thread 
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